[SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API #47376

chirag-s-db · 2024-07-16T22:33:03Z

What changes were proposed in this pull request?

Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

Why are the changes needed?

Provides another way for users to create clustered tables for streaming writes.

Does this PR introduce any user-facing change?

Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

How was this patch tested?

See new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

into cluster-by-stream

chirag-s-db · 2024-07-18T18:19:00Z

Note: this PR depends on #47301

zedtang

LGTM

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala

common/utils/src/main/resources/error/error-conditions.json

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

HeartSaVioR · 2024-07-24T13:59:34Z

Please allow me reviewing after #47301 is merged and this PR is rebased. It seems to be around corner.

chirag-s-db · 2024-07-25T15:49:40Z

@HeartSaVioR #47301 has been merged, ready for review again!

HeartSaVioR

+1

HeartSaVioR · 2024-07-29T06:51:45Z

Thanks! Merging to master.

…Writer API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

…Writer API Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. Provides another way for users to create clustered tables for streaming writes. Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. See new unit tests. No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>

changes

3e83ec8

github-actions bot added SQL STRUCTURED STREAMING labels Jul 16, 2024

chirag-s-db and others added 3 commits July 16, 2024 15:42

Merge branch 'apache:master' into cluster-by-stream

167744b

fix

1a19f3d

Merge branch 'cluster-by-stream' of https://github.com/chirag-s-db/spark

2e5011b

into cluster-by-stream

github-actions bot added the BUILD label Jul 17, 2024

spark connect changes

a0bf120

github-actions bot added the CONNECT label Jul 18, 2024

chirag-s-db changed the title ~~[WIP][SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala~~ [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala Jul 18, 2024

chirag-s-db marked this pull request as ready for review July 18, 2024 16:54

changes

325655c

chirag-s-db changed the title ~~[SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala~~ [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API Jul 18, 2024

github-actions bot added the PYTHON label Jul 18, 2024

chirag-s-db changed the title ~~[SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API~~ [SPARK-48901][SPARK-48916][SQL] Introduce clusterBy DataStreamWriter API Jul 18, 2024

chirag-s-db changed the title ~~[SPARK-48901][SPARK-48916][SQL] Introduce clusterBy DataStreamWriter API~~ [SPARK-48901][SPARK-48916][SQL][PYTHON] Introduce clusterBy DataStreamWriter API Jul 18, 2024

chirag-s-db changed the title ~~[SPARK-48901][SPARK-48916][SQL][PYTHON] Introduce clusterBy DataStreamWriter API~~ [SPARK-48901][SPARK-48916][STREAMING][PYTHON] Introduce clusterBy DataStreamWriter API Jul 18, 2024

zedtang approved these changes Jul 22, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala Show resolved Hide resolved