-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API #47376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note: this PR depends on #47301 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamWriter.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/connector/catalog/InMemoryBaseTable.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
Show resolved
Hide resolved
Please allow me reviewing after #47301 is merged and this PR is rebased. It seems to be around corner. |
@HeartSaVioR #47301 has been merged, ready for review again! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks! Merging to master. |
…Writer API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…Writer API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…Writer API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…Writer API ### What changes were proposed in this pull request? Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. ### Why are the changes needed? Provides another way for users to create clustered tables for streaming writes. ### Does this PR introduce _any_ user-facing change? Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. ### How was this patch tested? See new unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…Writer API Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. Provides another way for users to create clustered tables for streaming writes. Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. See new unit tests. No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…Writer API Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark. Provides another way for users to create clustered tables for streaming writes. Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames. See new unit tests. No. Closes apache#47376 from chirag-s-db/cluster-by-stream. Lead-authored-by: Chirag Singh <chirag.singh@databricks.com> Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
What changes were proposed in this pull request?
Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.
Why are the changes needed?
Provides another way for users to create clustered tables for streaming writes.
Does this PR introduce any user-facing change?
Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.
How was this patch tested?
See new unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.