Skip to content

[GLUTEN-8912][VL] Add Offset support for CollectLimitExec #8914

Merged
jinchengchenghh merged 11 commits intoapache:mainfrom
ArnavBalyan:arnavb/collect-limit-offset
Apr 18, 2025
Merged

[GLUTEN-8912][VL] Add Offset support for CollectLimitExec #8914
jinchengchenghh merged 11 commits intoapache:mainfrom
ArnavBalyan:arnavb/collect-limit-offset

Conversation

@ArnavBalyan
Copy link
Member

What changes were proposed in this pull request?

  • Add offset ability for collect limit exec operator.
  • Also makes it compatible with newer spark versions - 3.4 and 3.5

How was this patch tested?

  • Unit Tests added.

@github-actions github-actions bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Mar 5, 2025
@github-actions
Copy link

github-actions bot commented Mar 5, 2025

#8912

@github-actions
Copy link

github-actions bot commented Mar 5, 2025

Run Gluten Clickhouse CI on x86

@@ -58,7 +58,7 @@ class GlutenSQLCollectLimitExecSuite extends WholeStageTransformerSuite {

testWithSpecifiedSparkVersion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testWithSpecifiedSparkVersion -> test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done thanks

}

testWithSpecifiedSparkVersion("ColumnarCollectLimitExec - with filter", Array("3.2", "3.3")) {
testWithSpecifiedSparkVersion(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, so as others

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

assertGlutenOperatorMatch[ColumnarCollectLimitBaseExec](unionDf, checkMatch = true)
}

testWithSpecifiedSparkVersion("ColumnarCollectLimitExec - offset test", Array("3.4", "3.5")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the result for spark3.3? Is the result also correct but operator not matched?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that, please also add the result check for spark3.2 and spark3.3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the test to cover more code path, such as limit(12)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3.3 it would fail at compile time since offset api is not available with collectlimitexec for older versions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more tests to cover the above scenario, spark UTs should also help

partition => {
val droppedRows = dropLimitedRows(partition, offset)
val adjustedLimit = Math.max(0, limit - offset)
collectLimitedRows(droppedRows, adjustedLimit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enhance the collectLimitedRows, we can slice the input RowVector from offset to adjustedLimit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, however it would not preserve order, since the current implementation closely matches Spark, and users may see unexpected ordering and failure across UTs. This keeps it similar to Spark implementation and maintains similar order as spark thanks

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 8ab2f0b to d0fd18c Compare March 16, 2025 17:46
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from d0fd18c to 3a84eb9 Compare March 18, 2025 14:00
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from a7cc0ae to 6316a59 Compare March 19, 2025 17:52
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Member Author

cc @jinchengchenghh addressed the comments could you please take a look thanks

processedRDD.mapPartitions(partition => collectLimitedRows(partition, limit))
processedRDD.mapPartitions(
partition => {
val droppedRows = dropLimitedRows(partition, offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add the argument offset to collectLimitedRows, just change it in function fetchNext, it can make the function much easier, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

val leftoverAfterSkip = batchSize - startIndex
rowsToSkip = 0

val needed = math.min(rowsToCollect, leftoverAfterSkip)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if needed <= remaining, we still need this logic, may return the total batch instead of sliced batch

f (currentBatchRowCount <= remaining) {
          rowsCollected += currentBatchRowCount
          ColumnarBatches.retain(currentBatch)
          nextBatch = Some(currentBatch)
        } else {
          val prunedBatch = VeloxColumnarBatches.slice(currentBatch, 0, remaining)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, startIndex would be 0, and leftoverAfterSkip = batchSize, leading to val prunedBatch = VeloxColumnarBatches.slice(batch, 0, batchSize)
Could you give some example of batch size with limit and offset for the above case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh does this address the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't need to do the slice in that case, slice batch is the total batch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you mean moving out this case to not slice, let me do the refactor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 1dd2ed7 to 4bde1a6 Compare March 22, 2025 07:24
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 4bde1a6 to 6cddf4d Compare March 24, 2025 11:16
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Member Author

cc @jinchengchenghh addressed all comments, can you please take a look thanks!

ColumnarBatches.retain(batch)
batch
} else {
val sliced = VeloxColumnarBatches.slice(batch, startIndex, needed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need val sliced

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 6cddf4d to df35d81 Compare March 24, 2025 16:28
@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from f15d81b to c5415db Compare March 28, 2025 09:21
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

cc @zhztheplayer, removing the check seems to have broken tests. I have opened this #9166, and adding the check here so that we can move forward with the PR. Please let me know what you think thanks

Do you mean you are incorporating a solution for #9166 in this PR? Would you help me locate the code? Thanks.

@ArnavBalyan
Copy link
Member Author

ArnavBalyan commented Mar 28, 2025

cc @zhztheplayer, removing the check seems to have broken tests. I have opened this #9166, and adding the check here so that we can move forward with the PR. Please let me know what you think thanks

Do you mean you are incorporating a solution for #9166 in this PR? Would you help me locate the code? Thanks.

Meant allowing using child to check for columnar execution and using it in this PR. We can take up the custom rule in the future.

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from c5415db to 326ddff Compare March 31, 2025 11:54
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@zhztheplayer
Copy link
Member

removing the check seems to have broken tests.

I suggest we figure out the reason of the test failures first. We'd make sure the operator outputs exactly the same data no matter it's offloaded or not. Otherwise it's a mismatch.

What did the broken tests look like?

@ArnavBalyan
Copy link
Member Author

removing the check seems to have broken tests.

I suggest we figure out the reason of the test failures first. We'd make sure the operator outputs exactly the same data no matter it's offloaded or not. Otherwise it's a mismatch.

What did the broken tests look like?

Yes, if we offload with the R2C in between collectLimit and it's child, it changes the number of jobs with Gluten. The operator outputs exactly the same data in both ways. However, the current implementation only supports if the child is columnar to avoid R2C overhead and the failing UTs. Thanks!

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 326ddff to 376059c Compare April 17, 2025 08:32
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from 01bea60 to bd56092 Compare April 17, 2025 13:10
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from bd56092 to 0f5b111 Compare April 18, 2025 07:52
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from ae1306c to b67ccc7 Compare April 18, 2025 12:57
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan ArnavBalyan force-pushed the arnavb/collect-limit-offset branch from b67ccc7 to 36a3300 Compare April 18, 2025 13:41
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@ArnavBalyan
Copy link
Member Author

This should be fixed with the post transform rule, could you please take a look and help re-run the uts thanks! @jinchengchenghh @zhztheplayer

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, GlutenSQLTestsTrait, Row}

class GlutenSQLCollectLimitExecSuite extends GlutenSQLTestsTrait {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big test file. Is it enough for us to only add this for the newest Spark version (3.5)? Further maintenance can be make easier then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we may need the 3.3 since older versions do not support the offset API, the tests are slightly different depending on the offset support which was added in 3.4 thanks!

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating.

@jinchengchenghh jinchengchenghh merged commit 4e5125c into apache:main Apr 18, 2025
47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants