[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in `CartesianRDD.getPartitions` #47019

wayneguow · 2024-06-19T03:44:26Z

What changes were proposed in this pull request?

This pr aims to optimize the error message by doing a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED.INITIALIZE error in CartesianRDD.getPartitions.

Why are the changes needed?

Optimize the error prompts for Spark users.

Does this PR introduce any user-facing change?

Yes, this PR changes user-facing error class and message.

How was this patch tested?

Add a new test case in RDDSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

JoshRosen

@timlee0119 also opened a PR for this this at #47016 . Compared to that other PR, this PR has unit tests and also uses the new structured error framework / error class, so I lean towards taking this PR's approach.

JoshRosen · 2024-06-19T04:29:45Z

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

@@ -53,11 +54,16 @@ class CartesianRDD[T: ClassTag, U: ClassTag](
  extends RDD[(T, U)](sc, Nil)
  with Serializable {

+  val numPartitionsInRdd1 = rdd1.partitions.length


This change means that rdd1.partitions will be computed possibly more eagerly than it would have been computed before, which is potentially an internal behavior change w.r.t. the old behavior. In principle this should not matter, but if we ever wanted to consider backport of this change then I might weakly incline towards keeping the existing call pattern just for the sake of minimizing the scope of changes.

@JoshRosen Thank you for your thoughtful advice, I will fix it.

JoshRosen

LGTM! I'll wait a day or so to see if anyone else has feedback, otherwise I'll loop back and merge this tomorrow. Thanks for working on this.

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

yaooqinn · 2024-06-20T07:15:20Z

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

+    val numPartitionsInRdd1 = rdd1.partitions.length
+    val partitionNum: Long = numPartitionsInRdd1.toLong * numPartitionsInRdd2.toLong
+    if (partitionNum > Int.MaxValue) {
+      throw SparkCoreErrors.cartesianPartitionNumOverflow(numPartitionsInRdd1, numPartitionsInRdd2)


I prefer to reuse COLLECTION_SIZE_LIMIT_EXCEEDED

@yaooqinn Thank you for your suggestion, let me change it.

yaooqinn

See org.apache.spark.sql.errors.QueryExecutionErrors.tooManyArrayElementsError

wayneguow · 2024-06-20T07:27:15Z

See org.apache.spark.sql.errors.QueryExecutionErrors.tooManyArrayElementsError

@yaooqinn This method is defined in spark-catalyst, but spark-catalyst depends on spark-core. How about I reuse this error class, but still define an error method in SparkCoreErrors?

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

LuciferYang

+1, LGTM

yaooqinn · 2024-06-21T03:48:47Z

Please update the PR description @wayneguow

wayneguow · 2024-06-21T04:11:33Z

Please update the PR description @wayneguow

Both title & description updated~

wayneguow · 2024-06-21T05:22:18Z

Thank you all. @JoshRosen @LuciferYang @yaooqinn

github-actions bot added the CORE label Jun 19, 2024

wayneguow closed this Jun 19, 2024

LuciferYang reopened this Jun 19, 2024

JoshRosen reviewed Jun 19, 2024

View reviewed changes

JoshRosen mentioned this pull request Jun 19, 2024

[SPARK-48656][Core] Throw better exception message when CartesianRDD partitions overflow #47016

Closed

wayneguow added 4 commits June 19, 2024 16:54

optimize error info

ec87000

update imports

47e456c

update imports

2e5a1a9

update

8d4ae49

wayneguow force-pushed the SPARK-48656 branch from 2d79a12 to 8d4ae49 Compare June 19, 2024 08:54

update

334b713

JoshRosen approved these changes Jun 19, 2024

View reviewed changes

yaooqinn reviewed Jun 20, 2024

View reviewed changes

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala Outdated Show resolved Hide resolved

yaooqinn reviewed Jun 20, 2024

View reviewed changes

yaooqinn requested changes Jun 20, 2024

View reviewed changes

wayneguow force-pushed the SPARK-48656 branch from e067315 to 334b713 Compare June 20, 2024 08:18

update

c276080

yaooqinn reviewed Jun 20, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala Outdated Show resolved Hide resolved

Update core/src/main/scala/org/apache/spark/rdd/CartesianRDD.scala

019f30f

yaooqinn approved these changes Jun 20, 2024

View reviewed changes

LuciferYang approved these changes Jun 20, 2024

View reviewed changes

wayneguow changed the title ~~[SPARK-48656][CORE] Optimize the error message that the number of partitions overflow during the cartesian operation of RDDs~~ [SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in CartesianRDD.getPartitions Jun 21, 2024

yaooqinn closed this in 3469ec6 Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in `CartesianRDD.getPartitions` #47019

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in `CartesianRDD.getPartitions` #47019

wayneguow commented Jun 19, 2024 •

edited

Loading

JoshRosen left a comment

JoshRosen Jun 19, 2024

wayneguow Jun 19, 2024

JoshRosen left a comment

yaooqinn Jun 20, 2024

wayneguow Jun 20, 2024

yaooqinn left a comment

wayneguow commented Jun 20, 2024 •

edited

Loading

LuciferYang left a comment

yaooqinn commented Jun 21, 2024

wayneguow commented Jun 21, 2024 •

edited

Loading

wayneguow commented Jun 21, 2024

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in CartesianRDD.getPartitions #47019

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in CartesianRDD.getPartitions #47019

Conversation

wayneguow commented Jun 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

JoshRosen left a comment

Choose a reason for hiding this comment

JoshRosen Jun 19, 2024

Choose a reason for hiding this comment

wayneguow Jun 19, 2024

Choose a reason for hiding this comment

JoshRosen left a comment

Choose a reason for hiding this comment

yaooqinn Jun 20, 2024

Choose a reason for hiding this comment

wayneguow Jun 20, 2024

Choose a reason for hiding this comment

yaooqinn left a comment

Choose a reason for hiding this comment

wayneguow commented Jun 20, 2024 • edited Loading

LuciferYang left a comment

Choose a reason for hiding this comment

yaooqinn commented Jun 21, 2024

wayneguow commented Jun 21, 2024 • edited Loading

wayneguow commented Jun 21, 2024

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in `CartesianRDD.getPartitions` #47019

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in `CartesianRDD.getPartitions` #47019

wayneguow commented Jun 19, 2024 •

edited

Loading

wayneguow commented Jun 20, 2024 •

edited

Loading

wayneguow commented Jun 21, 2024 •

edited

Loading