[SPARK-55887][CONNECT] Special handling for `CollectLimitExec/CollectTailExec` to avoid full table scans by LuciferYang · Pull Request #54685 · apache/spark

LuciferYang · 2026-03-09T08:50:19Z

What changes were proposed in this pull request?

This PR updates SparkConnectPlanExecution to use executeCollect() instead of execute() when processing CollectLimitExec and CollectTailExec physical plans.

In Spark Connect, operations like head(), take(), and tail() are translated into CollectLimitExec or CollectTailExec physical nodes. Previously, these were executed via the standard execute() path, which often resulted in scanning all partitions before reducing the results.

By switching to executeCollect(), Spark Connect now leverages the optimized executeTake() and executeTail() implementations already present in Spark Classic. These optimizations ensure that only the necessary partitions are scanned (e.g., scanning only the first partition for head(1)), significantly reducing I/O and task overhead.

Why are the changes needed?

Parity with Spark Classic behavior and performance optimization.

In Spark Classic, Dataset.collect() (and by extension head/take/tail) uses plan.executeCollect(). This path includes optimizations to avoid full table scans:

CollectLimitExec uses executeTake(): It starts by scanning only the first partition and incrementally scans more only if the limit isn't met.
CollectTailExec uses executeTail(): It starts scanning from the last partition backwards.

In Spark Connect (Before this PR), SparkConnectPlanExecution used plan.execute(). For a limit(1) query on a 100-partition table, this would launch 100 tasks (one for each partition's LocalLimit), causing unnecessary computation and resource usage.

Example Scenario:
Running spark.range(0, 10000, 1, 100).limit(1).collect():

Classic: Launches 1 task (scans partition 0).
Connect (Before): Launches 100 tasks (scans partitions 0-99).
Connect (After): Launches 1 task (scans partition 0).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new test cases in SparkConnectServiceSuite to verify the task count reduction:

test("SPARK-55887: Use executeCollect for limit to avoid full scan"): Verified that limit(1) on a 100-partition DataFrame triggers significantly fewer tasks than partitions (expected: 1 task).
test("SPARK-55887: Use executeCollect for tail to avoid full scan"): Verified that tail(1) on a 100-partition DataFrame triggers significantly fewer tasks than partitions (expected: 1 task).

Was this patch authored or co-authored using generative AI tooling?

Test cases were generated with the assistance of Gemini 3.

LuciferYang · 2026-03-09T08:51:59Z

test first

LuciferYang · 2026-03-09T09:54:03Z

...server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala

          sendBatch(bytes, count, offset)
          offset += count
        }
+      case collectLimit: CollectLimitExec =>


This fix might bring additional memory pressure to the Connect server. However, I think we can implement a simple fix first and then look for a better solution later.

LuciferYang · 2026-03-09T13:54:08Z

Let me test it in the production environment.

LuciferYang · 2026-03-10T02:38:07Z

Let me test it in the production environment.

The test indicates that the function is working ok.

LuciferYang · 2026-03-11T05:25:44Z

Thank you @dongjoon-hyun

zhengruifeng · 2026-03-12T08:41:55Z

...server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala

          offset += count
        }
+      case collectLimit: CollectLimitExec =>
+        SQLExecution.withNewExecutionId(dataframe.queryExecution, Some("collectArrow")) {


nit: shall we use names like collectLimitArrow/collectTailArrow?

dongjoon-hyun · 2026-03-16T16:26:57Z

Merged to master for Apache Spark 4.2.0. Thank you, @LuciferYang and all.

LuciferYang · 2026-03-16T16:36:05Z

Thanks @dongjoon-hyun @zhengruifeng @yikf

init

5bb9367

LuciferYang marked this pull request as draft March 9, 2026 08:50

add test

5dee31e

LuciferYang changed the title ~~Special handling of CollectLimitExec/CollectTailExec~~ [SPARK-55887][CONNECT] Special handling for CollectLimitExec/CollectTailExec to avoid full table scans Mar 9, 2026

LuciferYang marked this pull request as ready for review March 9, 2026 09:52

LuciferYang commented Mar 9, 2026

View reviewed changes

LuciferYang added 2 commits March 9, 2026 17:57

use ticket

855cdae

fix format

056db55

LuciferYang marked this pull request as draft March 9, 2026 13:37

LuciferYang marked this pull request as ready for review March 10, 2026 02:37

dongjoon-hyun approved these changes Mar 10, 2026

View reviewed changes

yikf approved these changes Mar 11, 2026

View reviewed changes

zhengruifeng reviewed Mar 12, 2026

View reviewed changes

zhengruifeng requested review from cloud-fan and hvanhovell March 12, 2026 08:42

LuciferYang added 2 commits March 12, 2026 17:09

address comments

bb2c241

Merge branch 'apache:master' into connect-collect-limit

e74122e

zhengruifeng approved these changes Mar 12, 2026

View reviewed changes

dongjoon-hyun closed this in a936ccf Mar 16, 2026

LuciferYang deleted the connect-collect-limit branch March 17, 2026 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55887][CONNECT] Special handling for `CollectLimitExec/CollectTailExec` to avoid full table scans#54685

[SPARK-55887][CONNECT] Special handling for `CollectLimitExec/CollectTailExec` to avoid full table scans#54685
LuciferYang wants to merge 6 commits intoapache:masterfrom
LuciferYang:connect-collect-limit

LuciferYang commented Mar 9, 2026 •

edited

Loading

Uh oh!

LuciferYang commented Mar 9, 2026

Uh oh!

LuciferYang Mar 9, 2026

Uh oh!

LuciferYang commented Mar 9, 2026

Uh oh!

LuciferYang commented Mar 10, 2026

Uh oh!

LuciferYang commented Mar 11, 2026

Uh oh!

zhengruifeng Mar 12, 2026

Uh oh!

LuciferYang Mar 12, 2026

Uh oh!

dongjoon-hyun commented Mar 16, 2026

Uh oh!

LuciferYang commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

LuciferYang commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Mar 9, 2026

Uh oh!

LuciferYang Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 9, 2026

Uh oh!

LuciferYang commented Mar 10, 2026

Uh oh!

LuciferYang commented Mar 11, 2026

Uh oh!

zhengruifeng Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 16, 2026

Uh oh!

LuciferYang commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LuciferYang commented Mar 9, 2026 •

edited

Loading