Skip to content

[SPARK-45908][Python] Add support for writing empty DataFrames to parquet with partitions#43798

Closed
ti1uan wants to merge 4 commits intoapache:masterfrom
ti1uan:SPARK-45908
Closed

[SPARK-45908][Python] Add support for writing empty DataFrames to parquet with partitions#43798
ti1uan wants to merge 4 commits intoapache:masterfrom
ti1uan:SPARK-45908

Conversation

@ti1uan
Copy link

@ti1uan ti1uan commented Nov 14, 2023

What changes were proposed in this pull request?

This change introduces a new functionality in the parquet method of the DataFrameWriter class to handle the writing of empty DataFrames to Parquet files, particularly when using partitioning. Previously, writing an empty DataFrame with partitions specified did not create any output in the target directory, which could lead to issues in subsequent jobs expecting files with the defined schema. Now, parquet method will check if the DataFrame is empty and if partitions are specified. If both conditions are true, private method _write_empty_partition is called to handle the empty DataFrame write operation.

Why are the changes needed?

This change addresses the issue reported in SPARK-45908 regarding the handling of empty DataFrames with partitions in PySpark's Parquet writing functionality.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested.

Was this patch authored or co-authored using generative AI tooling?

No

@ti1uan ti1uan changed the title [Spark 45908][Python] [Spark 45908][Python] Add support for writing empty DataFrames to parquet with partitions Nov 14, 2023
@ti1uan ti1uan changed the title [Spark 45908][Python] Add support for writing empty DataFrames to parquet with partitions [Spark-45908][Python] Add support for writing empty DataFrames to parquet with partitions Nov 14, 2023
@HyukjinKwon HyukjinKwon changed the title [Spark-45908][Python] Add support for writing empty DataFrames to parquet with partitions [SPARK-45908][Python] Add support for writing empty DataFrames to parquet with partitions Nov 15, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work with Scala API too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. No, this change only applies for PySpark, Scala API has no change with it. Do you think we should support it for Scala API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change has to be made for API parity.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this PR to draft and work on changes for Scala API

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HyukjinKwon, I'm under the impression that PySpark relies on the underlying Scala API for Parquet operations. If that's correct, would updating the Scala API alone be sufficient to introduce this behavior to both PySpark and Scala API?

@ti1uan ti1uan marked this pull request as draft November 15, 2023 06:34
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 24, 2024
@github-actions github-actions bot closed this Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments