Skip to content

Conversation

chirag-s-db
Copy link
Contributor

@chirag-s-db chirag-s-db commented Jul 16, 2024

What changes were proposed in this pull request?

Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

Why are the changes needed?

Provides another way for users to create clustered tables for streaming writes.

Does this PR introduce any user-facing change?

Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

How was this patch tested?

See new unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the BUILD label Jul 17, 2024
@chirag-s-db chirag-s-db changed the title [WIP][SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala Jul 18, 2024
@chirag-s-db chirag-s-db marked this pull request as ready for review July 18, 2024 16:54
@chirag-s-db chirag-s-db changed the title [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API for Scala [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API Jul 18, 2024
@chirag-s-db chirag-s-db changed the title [SPARK-48901][SQL] Introduce clusterBy DataStreamWriter API [SPARK-48901][SPARK-48916][SQL] Introduce clusterBy DataStreamWriter API Jul 18, 2024
@chirag-s-db chirag-s-db changed the title [SPARK-48901][SPARK-48916][SQL] Introduce clusterBy DataStreamWriter API [SPARK-48901][SPARK-48916][SQL][PYTHON] Introduce clusterBy DataStreamWriter API Jul 18, 2024
@chirag-s-db chirag-s-db changed the title [SPARK-48901][SPARK-48916][SQL][PYTHON] Introduce clusterBy DataStreamWriter API [SPARK-48901][SPARK-48916][STREAMING][PYTHON] Introduce clusterBy DataStreamWriter API Jul 18, 2024
@chirag-s-db
Copy link
Contributor Author

Note: this PR depends on #47301

Copy link
Contributor

@zedtang zedtang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@HeartSaVioR HeartSaVioR changed the title [SPARK-48901][SPARK-48916][STREAMING][PYTHON] Introduce clusterBy DataStreamWriter API [SPARK-48901][SPARK-48916][S][PYTHON] Introduce clusterBy DataStreamWriter API Jul 23, 2024
@HeartSaVioR HeartSaVioR changed the title [SPARK-48901][SPARK-48916][S][PYTHON] Introduce clusterBy DataStreamWriter API [SPARK-48901][SPARK-48916][SS][PYTHON] Introduce clusterBy DataStreamWriter API Jul 23, 2024
@HeartSaVioR
Copy link
Contributor

Please allow me reviewing after #47301 is merged and this PR is rebased. It seems to be around corner.

@chirag-s-db
Copy link
Contributor Author

@HeartSaVioR #47301 has been merged, ready for review again!

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
…Writer API

### What changes were proposed in this pull request?
Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

### Why are the changes needed?
Provides another way for users to create clustered tables for streaming writes.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

### How was this patch tested?
See new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
fusheng9399 pushed a commit to fusheng9399/spark that referenced this pull request Aug 6, 2024
…Writer API

### What changes were proposed in this pull request?
Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

### Why are the changes needed?
Provides another way for users to create clustered tables for streaming writes.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

### How was this patch tested?
See new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…Writer API

### What changes were proposed in this pull request?
Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

### Why are the changes needed?
Provides another way for users to create clustered tables for streaming writes.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

### How was this patch tested?
See new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…Writer API

### What changes were proposed in this pull request?
Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

### Why are the changes needed?
Provides another way for users to create clustered tables for streaming writes.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

### How was this patch tested?
See new unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
lwz9103 pushed a commit to Kyligence/spark that referenced this pull request Mar 27, 2025
…Writer API

Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

Provides another way for users to create clustered tables for streaming writes.

Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

See new unit tests.

No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
lwz9103 pushed a commit to Kyligence/spark that referenced this pull request Apr 22, 2025
…Writer API

Introduce a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark.

Provides another way for users to create clustered tables for streaming writes.

Yes, it adds a new clusterBy DataStreamWriter API in Scala, Spark Connect, and Pyspark to allow specifying the clustering columns when writing streaming DataFrames.

See new unit tests.

No.

Closes apache#47376 from chirag-s-db/cluster-by-stream.

Lead-authored-by: Chirag Singh <chirag.singh@databricks.com>
Co-authored-by: Chirag Singh <137233133+chirag-s-db@users.noreply.github.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants