Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in CartesianRDD.getPartitions #47019

Closed
wants to merge 7 commits into from

Conversation

wayneguow
Copy link
Contributor

@wayneguow wayneguow commented Jun 19, 2024

What changes were proposed in this pull request?

This pr aims to optimize the error message by doing a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED.INITIALIZE error in CartesianRDD.getPartitions.

Why are the changes needed?

Optimize the error prompts for Spark users.

Does this PR introduce any user-facing change?

Yes, this PR changes user-facing error class and message.

How was this patch tested?

Add a new test case in RDDSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jun 19, 2024
@wayneguow wayneguow closed this Jun 19, 2024
@LuciferYang LuciferYang reopened this Jun 19, 2024
Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timlee0119 also opened a PR for this this at #47016 . Compared to that other PR, this PR has unit tests and also uses the new structured error framework / error class, so I lean towards taking this PR's approach.

@@ -53,11 +54,16 @@ class CartesianRDD[T: ClassTag, U: ClassTag](
extends RDD[(T, U)](sc, Nil)
with Serializable {

val numPartitionsInRdd1 = rdd1.partitions.length
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means that rdd1.partitions will be computed possibly more eagerly than it would have been computed before, which is potentially an internal behavior change w.r.t. the old behavior. In principle this should not matter, but if we ever wanted to consider backport of this change then I might weakly incline towards keeping the existing call pattern just for the sake of minimizing the scope of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen Thank you for your thoughtful advice, I will fix it.

Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll wait a day or so to see if anyone else has feedback, otherwise I'll loop back and merge this tomorrow. Thanks for working on this.

val numPartitionsInRdd1 = rdd1.partitions.length
val partitionNum: Long = numPartitionsInRdd1.toLong * numPartitionsInRdd2.toLong
if (partitionNum > Int.MaxValue) {
throw SparkCoreErrors.cartesianPartitionNumOverflow(numPartitionsInRdd1, numPartitionsInRdd2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to reuse COLLECTION_SIZE_LIMIT_EXCEEDED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaooqinn Thank you for your suggestion, let me change it.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See org.apache.spark.sql.errors.QueryExecutionErrors.tooManyArrayElementsError

@wayneguow
Copy link
Contributor Author

wayneguow commented Jun 20, 2024

See org.apache.spark.sql.errors.QueryExecutionErrors.tooManyArrayElementsError

@yaooqinn This method is defined in spark-catalyst, but spark-catalyst depends on spark-core. How about I reuse this error class, but still define an error method in SparkCoreErrors?

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@yaooqinn
Copy link
Member

Please update the PR description @wayneguow

@wayneguow wayneguow changed the title [SPARK-48656][CORE] Optimize the error message that the number of partitions overflow during the cartesian operation of RDDs [SPARK-48656][CORE] Do a length check and throw COLLECTION_SIZE_LIMIT_EXCEEDED error in CartesianRDD.getPartitions Jun 21, 2024
@wayneguow
Copy link
Contributor Author

wayneguow commented Jun 21, 2024

Please update the PR description @wayneguow

Both title & description updated~

@yaooqinn yaooqinn closed this in 3469ec6 Jun 21, 2024
@wayneguow
Copy link
Contributor Author

Thank you all. @JoshRosen @LuciferYang @yaooqinn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants