[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode#8321
[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode#8321loneylee merged 15 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/apache/incubator-gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
@PHILO-HE @taiyang-li Please have a review of this pr. |
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| <id>kafka</id> | ||
| <activation> | ||
| <activeByDefault>false</activeByDefault> | ||
| </activation> |
There was a problem hiding this comment.
What's the consideration of making kafka support an optional feature of Gluten? Is it because enabling the support will introduce a bunch more Jar dependencies that may cause unwanted dependency conflicts?
There was a problem hiding this comment.
Avoid unnecessary jar dependencies for structured streaming. It will be an experimental function of native streaming for a long time. It will not affect the current mainline function development.
| @transient override lazy val inputPartitions: Seq[InputPartition] = inputPartitionsShim | ||
|
|
||
| @transient protected lazy val inputPartitionsShim: Seq[InputPartition] = | ||
| batch.planInputPartitions() |
There was a problem hiding this comment.
Why breaking the variable into two? Also, it the variable name inputPartitionsShim a little bit confusing?
There was a problem hiding this comment.
MicroBatchScanExecTransformer inherit BatchScanExecTransformerBase. In spark32, inputPartitions named partitions. For MicroBatchScanExecTransformer,
in all spark versions the only different now is the name. I don't implement transform in all shims, only add a val.
There was a problem hiding this comment.
Do you have any other suggestions?
There was a problem hiding this comment.
Thanks for the explanation.
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
| public void setStreamKafka(boolean streamKafka) { | ||
| this.streamKafka = streamKafka; | ||
| } |
There was a problem hiding this comment.
This is like a workaround given ReadRelNode was likely designed to be immutable. Could you do some refactors to make streamKafka final? Thanks.
| // Used to KafkaBatch or KafkaContinuous source | ||
| message StreamKafka { | ||
| message TopicPartition { | ||
| string topic = 1; | ||
| int32 partition = 2; | ||
| } | ||
|
|
||
| TopicPartition topic_partition = 1; | ||
| int64 start_offset = 2; | ||
| int64 end_offset = 3; | ||
| map<string, string> params = 4; | ||
| int64 poll_timeout_ms = 5; | ||
| bool fail_on_data_loss = 6; | ||
| bool include_headers = 7; | ||
| } |
There was a problem hiding this comment.
Perhaps update https://github.com/apache/incubator-gluten/blob/main/docs/developers/SubstraitModifications.md as well?
zhztheplayer
left a comment
There was a problem hiding this comment.
The code structure looks great to me.
I am not a CH / Kafka expert so feel free to call other members for review in detail.
fa5c7fd to
2474b5d
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121) * Fix build due to ClickHouse/ClickHouse#74085 * Fix build due to ClickHouse/ClickHouse#74727 * Fix gtest build due to ClickHouse/ClickHouse#74085 * Fix Gtest failed due to #8321 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>
…pache#8321) * [CH] Support MicroBatchScanExec with KafkaScan in batch mode
) * [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121) * Fix build due to ClickHouse/ClickHouse#74085 * Fix build due to ClickHouse/ClickHouse#74727 * Fix gtest build due to ClickHouse/ClickHouse#74085 * Fix Gtest failed due to apache#8321 --------- Co-authored-by: kyligence-git <gluten@kyligence.io> Co-authored-by: Chang Chen <baibaichen@gmail.com>
What changes were proposed in this pull request?
Support spark struct streaming as follow: