[GLUTEN-8912][VL] Add Offset support for CollectLimitExec by ArnavBalyan · Pull Request #8914 · apache/gluten

ArnavBalyan · 2025-03-05T15:08:26Z

What changes were proposed in this pull request?

Add offset ability for collect limit exec operator.
Also makes it compatible with newer spark versions - 3.4 and 3.5

How was this patch tested?

Unit Tests added.

github-actions · 2025-03-05T15:08:45Z

#8912

github-actions · 2025-03-05T15:09:01Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2025-03-06T09:33:28Z

backends-velox/src/test/scala/org/apache/gluten/execution/GlutenSQLCollectLimitExecSuite.scala

@@ -58,7 +58,7 @@ class GlutenSQLCollectLimitExecSuite extends WholeStageTransformerSuite {

  testWithSpecifiedSparkVersion(


testWithSpecifiedSparkVersion -> test

done thanks

jinchengchenghh · 2025-03-06T09:33:37Z

backends-velox/src/test/scala/org/apache/gluten/execution/GlutenSQLCollectLimitExecSuite.scala

  }

-  testWithSpecifiedSparkVersion("ColumnarCollectLimitExec - with filter", Array("3.2", "3.3")) {
+  testWithSpecifiedSparkVersion(


ditto, so as others

jinchengchenghh · 2025-03-06T09:34:47Z

backends-velox/src/test/scala/org/apache/gluten/execution/GlutenSQLCollectLimitExecSuite.scala

    assertGlutenOperatorMatch[ColumnarCollectLimitBaseExec](unionDf, checkMatch = true)
  }
+
+  testWithSpecifiedSparkVersion("ColumnarCollectLimitExec - offset test", Array("3.4", "3.5")) {


What's the result for spark3.3? Is the result also correct but operator not matched?

If that, please also add the result check for spark3.2 and spark3.3

Please add the test to cover more code path, such as limit(12)

For 3.3 it would fail at compile time since offset api is not available with collectlimitexec for older versions

Added more tests to cover the above scenario, spark UTs should also help

jinchengchenghh · 2025-03-06T10:02:04Z

backends-velox/src/main/scala/org/apache/gluten/execution/ColumnarCollectLimitExec.scala

+      partition => {
+        val droppedRows = dropLimitedRows(partition, offset)
+        val adjustedLimit = Math.max(0, limit - offset)
+        collectLimitedRows(droppedRows, adjustedLimit)


Can we enhance the collectLimitedRows, we can slice the input RowVector from offset to adjustedLimit?

Yes, however it would not preserve order, since the current implementation closely matches Spark, and users may see unexpected ordering and failure across UTs. This keeps it similar to Spark implementation and maintains similar order as spark thanks

github-actions · 2025-03-16T16:04:19Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-16T17:47:19Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-18T14:01:15Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-19T12:48:05Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-19T17:53:29Z

Run Gluten Clickhouse CI on x86

ArnavBalyan · 2025-03-20T00:01:33Z

cc @jinchengchenghh addressed the comments could you please take a look thanks

jinchengchenghh · 2025-03-20T10:30:07Z

backends-velox/src/main/scala/org/apache/gluten/execution/ColumnarCollectLimitExec.scala

-    processedRDD.mapPartitions(partition => collectLimitedRows(partition, limit))
+    processedRDD.mapPartitions(
+      partition => {
+        val droppedRows = dropLimitedRows(partition, offset)


We can add the argument offset to collectLimitedRows, just change it in function fetchNext, it can make the function much easier, right?

github-actions · 2025-03-20T12:19:11Z

Run Gluten Clickhouse CI on x86

jinchengchenghh · 2025-03-20T13:29:10Z

backends-velox/src/main/scala/org/apache/gluten/execution/ColumnarCollectLimitExec.scala

+            val leftoverAfterSkip = batchSize - startIndex
+            rowsToSkip = 0
+
+            val needed = math.min(rowsToCollect, leftoverAfterSkip)


if needed <= remaining, we still need this logic, may return the total batch instead of sliced batch

f (currentBatchRowCount <= remaining) { rowsCollected += currentBatchRowCount ColumnarBatches.retain(currentBatch) nextBatch = Some(currentBatch) } else { val prunedBatch = VeloxColumnarBatches.slice(currentBatch, 0, remaining)

In that case, startIndex would be 0, and leftoverAfterSkip = batchSize, leading to val prunedBatch = VeloxColumnarBatches.slice(batch, 0, batchSize)
Could you give some example of batch size with limit and offset for the above case

@jinchengchenghh does this address the comment?

So we don't need to do the slice in that case, slice batch is the total batch.

I see, you mean moving out this case to not slice, let me do the refactor

github-actions · 2025-03-22T07:24:45Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-03-24T11:17:03Z

Run Gluten Clickhouse CI on x86

ArnavBalyan · 2025-03-24T16:24:22Z

cc @jinchengchenghh addressed all comments, can you please take a look thanks!

jinchengchenghh · 2025-03-24T16:26:18Z

backends-velox/src/main/scala/org/apache/gluten/execution/ColumnarCollectLimitExec.scala

+                ColumnarBatches.retain(batch)
+                batch
+              } else {
+                val sliced = VeloxColumnarBatches.slice(batch, startIndex, needed)


Don't need val sliced

github-actions · 2025-03-28T09:22:24Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-03-28T09:47:07Z

cc @zhztheplayer, removing the check seems to have broken tests. I have opened this #9166, and adding the check here so that we can move forward with the PR. Please let me know what you think thanks

Do you mean you are incorporating a solution for #9166 in this PR? Would you help me locate the code? Thanks.

ArnavBalyan · 2025-03-28T17:10:23Z

cc @zhztheplayer, removing the check seems to have broken tests. I have opened this #9166, and adding the check here so that we can move forward with the PR. Please let me know what you think thanks

Do you mean you are incorporating a solution for #9166 in this PR? Would you help me locate the code? Thanks.

Meant allowing using child to check for columnar execution and using it in this PR. We can take up the custom rule in the future.

github-actions · 2025-03-31T11:54:56Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-04-04T09:16:48Z

removing the check seems to have broken tests.

I suggest we figure out the reason of the test failures first. We'd make sure the operator outputs exactly the same data no matter it's offloaded or not. Otherwise it's a mismatch.

What did the broken tests look like?

ArnavBalyan · 2025-04-14T04:40:29Z

removing the check seems to have broken tests.

I suggest we figure out the reason of the test failures first. We'd make sure the operator outputs exactly the same data no matter it's offloaded or not. Otherwise it's a mismatch.

What did the broken tests look like?

Yes, if we offload with the R2C in between collectLimit and it's child, it changes the number of jobs with Gluten. The operator outputs exactly the same data in both ways. However, the current implementation only supports if the child is columnar to avoid R2C overhead and the failing UTs. Thanks!

github-actions · 2025-04-17T08:32:39Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-17T13:10:00Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-17T13:11:21Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-18T07:52:41Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-18T12:09:28Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-18T12:58:04Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-04-18T13:42:12Z

Run Gluten Clickhouse CI on x86

ArnavBalyan · 2025-04-18T15:40:07Z

This should be fixed with the post transform rule, could you please take a look and help re-run the uts thanks! @jinchengchenghh @zhztheplayer

zhztheplayer · 2025-04-18T16:00:57Z

...t/spark35/src/test/scala/org/apache/spark/sql/execution/GlutenSQLCollectLimitExecSuite.scala

+import org.apache.spark.SparkConf
+import org.apache.spark.sql.{DataFrame, GlutenSQLTestsTrait, Row}
+
+class GlutenSQLCollectLimitExecSuite extends GlutenSQLTestsTrait {


This is a big test file. Is it enough for us to only add this for the newest Spark version (3.5)? Further maintenance can be make easier then.

It seems we may need the 3.3 since older versions do not support the offset API, the tests are slightly different depending on the offset support which was added in 3.4 thanks!

zhztheplayer

Thank you for iterating.

ArnavBalyan added 2 commits March 5, 2025 14:54

update

b679e28

update

3daed34

github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Mar 5, 2025

jinchengchenghh reviewed Mar 6, 2025

View reviewed changes

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 8ab2f0b to d0fd18c Compare March 16, 2025 17:46

update

3a84eb9

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from d0fd18c to 3a84eb9 Compare March 18, 2025 14:00

update

6316a59

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from a7cc0ae to 6316a59 Compare March 19, 2025 17:52

jinchengchenghh reviewed Mar 20, 2025

View reviewed changes

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 1dd2ed7 to 4bde1a6 Compare March 22, 2025 07:24

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 4bde1a6 to 6cddf4d Compare March 24, 2025 11:16

jinchengchenghh reviewed Mar 24, 2025

View reviewed changes

update

df35d81

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 6cddf4d to df35d81 Compare March 24, 2025 16:28

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from f15d81b to c5415db Compare March 28, 2025 09:21

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from c5415db to 326ddff Compare March 31, 2025 11:54

update

376059c

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 326ddff to 376059c Compare April 17, 2025 08:32

Merge branch 'main' into arnavb/collect-limit-offset

57711f3

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 01bea60 to bd56092 Compare April 17, 2025 13:10

Merge branch 'main' into arnavb/collect-limit-offset

0f5b111

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from bd56092 to 0f5b111 Compare April 18, 2025 07:52

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from ae1306c to b67ccc7 Compare April 18, 2025 12:57

manage through post-transform

36a3300

ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from b67ccc7 to 36a3300 Compare April 18, 2025 13:41

zhztheplayer reviewed Apr 18, 2025

View reviewed changes

zhztheplayer approved these changes Apr 18, 2025

View reviewed changes

jinchengchenghh merged commit 4e5125c into apache:main Apr 18, 2025
47 checks passed

exmy mentioned this pull request Apr 23, 2025

[GLUTEN-9137][CH] Support CollectLimit for CH backend #9139

Merged

ArnavBalyan mentioned this pull request Apr 28, 2025

[VL] Remove CollectLimit dependency from Offload Rules #9451

Merged

		@@ -58,7 +58,7 @@ class GlutenSQLCollectLimitExecSuite extends WholeStageTransformerSuite {

		testWithSpecifiedSparkVersion(

Conversation

ArnavBalyan commented Mar 5, 2025

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

github-actions bot commented Mar 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

github-actions bot commented Mar 18, 2025

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

github-actions bot commented Mar 19, 2025

Uh oh!

ArnavBalyan commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 22, 2025

Uh oh!

github-actions bot commented Mar 24, 2025

Uh oh!

ArnavBalyan commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 28, 2025

Uh oh!

zhztheplayer commented Mar 28, 2025

Uh oh!

ArnavBalyan commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

ArnavBalyan commented Mar 28, 2025 •

edited

Loading