Skip to content

[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode#8321

Merged
loneylee merged 15 commits intoapache:mainfrom
loneylee:micro_scan
Jan 20, 2025
Merged

[CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode#8321
loneylee merged 15 commits intoapache:mainfrom
loneylee:micro_scan

Conversation

@loneylee
Copy link
Member

What changes were proposed in this pull request?

Support spark struct streaming as follow:

  • source: kafka(batch mode - MicroBatchScanExec)
  • sink: file

@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Dec 24, 2024
@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

2 similar comments
@lwz9103
Copy link
Member

lwz9103 commented Dec 24, 2024

Run Gluten Clickhouse CI on x86

@lwz9103
Copy link
Member

lwz9103 commented Dec 24, 2024

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 3, 2025

Run Gluten Clickhouse CI on x86

1 similar comment
@loneylee
Copy link
Member Author

loneylee commented Jan 3, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 6, 2025

Run Gluten Clickhouse CI on x86

@loneylee
Copy link
Member Author

loneylee commented Jan 7, 2025

@PHILO-HE @taiyang-li Please have a review of this pr.

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

Run Gluten Clickhouse CI on x86

@PHILO-HE PHILO-HE changed the title [CH] Support MicroBatchScanExec with KafkaScan in batch mode [CORE][CH] Support MicroBatchScanExec with KafkaScan in batch mode Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 8, 2025

Run Gluten Clickhouse CI on x86

Comment on lines +808 to +811
<id>kafka</id>
<activation>
<activeByDefault>false</activeByDefault>
</activation>
Copy link
Member

@zhztheplayer zhztheplayer Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the consideration of making kafka support an optional feature of Gluten? Is it because enabling the support will introduce a bunch more Jar dependencies that may cause unwanted dependency conflicts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid unnecessary jar dependencies for structured streaming. It will be an experimental function of native streaming for a long time. It will not affect the current mainline function development.

Comment on lines +60 to +63
@transient override lazy val inputPartitions: Seq[InputPartition] = inputPartitionsShim

@transient protected lazy val inputPartitionsShim: Seq[InputPartition] =
batch.planInputPartitions()
Copy link
Member

@zhztheplayer zhztheplayer Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why breaking the variable into two? Also, it the variable name inputPartitionsShim a little bit confusing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MicroBatchScanExecTransformer inherit BatchScanExecTransformerBase. In spark32, inputPartitions named partitions. For MicroBatchScanExecTransformer,
in all spark versions the only different now is the name. I don't implement transform in all shims, only add a val.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other suggestions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Comment on lines +55 to +57
public void setStreamKafka(boolean streamKafka) {
this.streamKafka = streamKafka;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is like a workaround given ReadRelNode was likely designed to be immutable. Could you do some refactors to make streamKafka final? Thanks.

Comment on lines +93 to +107
// Used to KafkaBatch or KafkaContinuous source
message StreamKafka {
message TopicPartition {
string topic = 1;
int32 partition = 2;
}

TopicPartition topic_partition = 1;
int64 start_offset = 2;
int64 end_offset = 3;
map<string, string> params = 4;
int64 poll_timeout_ms = 5;
bool fail_on_data_loss = 6;
bool include_headers = 7;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@zhztheplayer zhztheplayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code structure looks great to me.

I am not a CH / Kafka expert so feel free to call other members for review in detail.

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

Copy link
Contributor

@baibaichen baibaichen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@loneylee
Copy link
Member Author

Run Gluten Clickhouse CI on x86

@loneylee loneylee merged commit b29aa3b into apache:main Jan 20, 2025
49 checks passed
baibaichen added a commit to Kyligence/gluten that referenced this pull request Jan 21, 2025
baibaichen added a commit that referenced this pull request Jan 21, 2025
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121)

* Fix build due to ClickHouse/ClickHouse#74085

* Fix build due to ClickHouse/ClickHouse#74727

* Fix gtest build due to ClickHouse/ClickHouse#74085

* Fix Gtest failed due to #8321

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang Chen <baibaichen@gmail.com>
baibaichen pushed a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025
…pache#8321)

* [CH] Support MicroBatchScanExec with KafkaScan in batch mode
baibaichen added a commit to baibaichen/gluten that referenced this pull request Feb 1, 2025
)

* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20250121)

* Fix build due to ClickHouse/ClickHouse#74085

* Fix build due to ClickHouse/ClickHouse#74727

* Fix gtest build due to ClickHouse/ClickHouse#74085

* Fix Gtest failed due to apache#8321

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang Chen <baibaichen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core DOCS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants