[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode by loneylee · Pull Request #8321 · apache/gluten

loneylee · 2024-12-24T07:17:40Z

What changes were proposed in this pull request?

Support spark struct streaming as follow:

source: kafka(batch mode - MicroBatchScanExec)
sink: file

github-actions · 2024-12-24T07:17:56Z

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Other pull requests

github-actions · 2024-12-24T07:18:12Z

Run Gluten Clickhouse CI on x86

lwz9103 · 2024-12-24T09:01:09Z

Run Gluten Clickhouse CI on x86

lwz9103 · 2024-12-24T10:27:48Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-24T13:26:29Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-25T12:50:46Z

Run Gluten Clickhouse CI on x86

github-actions · 2024-12-26T03:03:13Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-03T07:34:19Z

Run Gluten Clickhouse CI on x86

loneylee · 2025-01-03T07:45:38Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-06T13:43:07Z

Run Gluten Clickhouse CI on x86

loneylee · 2025-01-07T02:25:16Z

@PHILO-HE @taiyang-li Please have a review of this pr.

github-actions · 2025-01-07T07:09:13Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T07:12:52Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-07T07:51:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-08T04:07:09Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-01-09T12:48:16Z

pom.xml

+      <id>kafka</id>
+      <activation>
+        <activeByDefault>false</activeByDefault>
+      </activation>


What's the consideration of making kafka support an optional feature of Gluten? Is it because enabling the support will introduce a bunch more Jar dependencies that may cause unwanted dependency conflicts?

Avoid unnecessary jar dependencies for structured streaming. It will be an experimental function of native streaming for a long time. It will not affect the current mainline function development.

zhztheplayer · 2025-01-09T12:53:39Z

...k34/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AbstractBatchScanExec.scala

+  @transient override lazy val inputPartitions: Seq[InputPartition] = inputPartitionsShim
+
+  @transient protected lazy val inputPartitionsShim: Seq[InputPartition] =
+    batch.planInputPartitions()


Why breaking the variable into two? Also, it the variable name inputPartitionsShim a little bit confusing?

MicroBatchScanExecTransformer inherit BatchScanExecTransformerBase. In spark32, inputPartitions named partitions. For MicroBatchScanExecTransformer,
in all spark versions the only different now is the name. I don't implement transform in all shims, only add a val.

Do you have any other suggestions?

@zhztheplayer

Thanks for the explanation.

github-actions · 2025-01-13T10:17:47Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-01-13T12:45:49Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-01-14T07:54:56Z

gluten-substrait/src/main/java/org/apache/gluten/substrait/rel/ReadRelNode.java

+  public void setStreamKafka(boolean streamKafka) {
+    this.streamKafka = streamKafka;
+  }


This is like a workaround given ReadRelNode was likely designed to be immutable. Could you do some refactors to make streamKafka final? Thanks.

zhztheplayer · 2025-01-14T07:57:07Z

gluten-substrait/src/main/resources/substrait/proto/substrait/algebra.proto

+  // Used to KafkaBatch or KafkaContinuous source
+  message StreamKafka {
+    message TopicPartition {
+      string topic = 1;
+      int32 partition = 2;
+    }
+
+    TopicPartition topic_partition = 1;
+    int64 start_offset = 2;
+    int64 end_offset = 3;
+    map<string, string> params = 4;
+    int64 poll_timeout_ms = 5;
+    bool  fail_on_data_loss = 6;
+    bool include_headers = 7;
+  }


Perhaps update https://github.com/apache/incubator-gluten/blob/main/docs/developers/SubstraitModifications.md as well?

zhztheplayer

The code structure looks great to me.

I am not a CH / Kafka expert so feel free to call other members for review in detail.

github-actions · 2025-01-17T02:54:49Z

Run Gluten Clickhouse CI on x86

baibaichen

LGTM

github-actions · 2025-01-17T07:43:55Z

Run Gluten Clickhouse CI on x86

loneylee · 2025-01-17T09:45:36Z

Run Gluten Clickhouse CI on x86

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121) * Fix build due to ClickHouse/ClickHouse#74085 * Fix build due to ClickHouse/ClickHouse#74727 * Fix gtest build due to ClickHouse/ClickHouse#74085 * Fix Gtest failed due to #8321 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>

…pache#8321) * [CH] Support MicroBatchScanExec with KafkaScan in batch mode

) * [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121) * Fix build due to ClickHouse/ClickHouse#74085 * Fix build due to ClickHouse/ClickHouse#74727 * Fix gtest build due to ClickHouse/ClickHouse#74085 * Fix Gtest failed due to apache#8321 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>

github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Dec 24, 2024

loneylee force-pushed the micro_scan branch from 391f2f7 to 96fd28c Compare December 24, 2024 13:25

loneylee force-pushed the micro_scan branch from 96fd28c to 3831e98 Compare December 25, 2024 12:50

loneylee force-pushed the micro_scan branch from 3831e98 to f5829d3 Compare December 26, 2024 03:02

loneylee requested a review from zzcclp January 3, 2025 04:04

loneylee force-pushed the micro_scan branch from f5829d3 to b828d2c Compare January 3, 2025 07:33

loneylee force-pushed the micro_scan branch from b828d2c to ee02250 Compare January 6, 2025 13:42

PHILO-HE changed the title ~~[CH] Support MicroBatchScanExec with KafkaScan in batch mode~~ [CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode Jan 7, 2025

zhztheplayer reviewed Jan 9, 2025

View reviewed changes

loneylee force-pushed the micro_scan branch from 185d398 to 6862086 Compare January 13, 2025 10:17

zhztheplayer reviewed Jan 14, 2025

View reviewed changes

baibaichen force-pushed the micro_scan branch from fa5c7fd to 2474b5d Compare January 17, 2025 02:54

baibaichen approved these changes Jan 17, 2025

View reviewed changes

loneylee added 15 commits January 17, 2025 14:43

[CH] Support MicroBatchScanExec with KafkaScan in batch mode

ec20b80

fix build

9d21248

fix rebase

89d7a02

fix license

9d1cb9e

rm todo

4da81ea

fix ci

afc6e07

add kafka on

dfff37f

fix ut

ed8441e

add more ut

830a658

update metrics

2dd2d47

add cmake

88dbaf5

fix pom

2e9f392

add input partition shim

bce42a9

fix version

e5de15c

Fix review

14f33c0

loneylee force-pushed the micro_scan branch from 2474b5d to 14f33c0 Compare January 17, 2025 07:43

github-actions bot added the DOCS label Jan 17, 2025

loneylee merged commit b29aa3b into apache:main Jan 20, 2025
49 checks passed

baibaichen added a commit to Kyligence/gluten that referenced this pull request Jan 21, 2025

Fix Gtest failed due to apache#8321

55ce1be

baibaichen pushed a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025

[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode (a…

3d18188

…pache#8321) * [CH] Support MicroBatchScanExec with KafkaScan in batch mode

Conversation

loneylee commented Dec 24, 2024

What changes were proposed in this pull request?

Uh oh!

github-actions bot commented Dec 24, 2024

Uh oh!

github-actions bot commented Dec 24, 2024

Uh oh!

lwz9103 commented Dec 24, 2024

Uh oh!

lwz9103 commented Dec 24, 2024

Uh oh!

github-actions bot commented Dec 24, 2024

Uh oh!

github-actions bot commented Dec 25, 2024

Uh oh!

github-actions bot commented Dec 26, 2024

Uh oh!

github-actions bot commented Jan 3, 2025

Uh oh!

loneylee commented Jan 3, 2025

Uh oh!

github-actions bot commented Jan 6, 2025

Uh oh!

loneylee commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 8, 2025

Uh oh!

zhztheplayer Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loneylee Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

loneylee Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

loneylee Jan 13, 2025

Choose a reason for hiding this comment

Uh oh!

loneylee Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 13, 2025

Uh oh!

github-actions bot commented Jan 13, 2025

Uh oh!

zhztheplayer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 17, 2025

Uh oh!

baibaichen left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 17, 2025

Uh oh!

loneylee commented Jan 17, 2025

Uh oh!

Uh oh!

zhztheplayer Jan 9, 2025 •

edited

Loading

zhztheplayer Jan 9, 2025 •

edited

Loading