[SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`. #38018

itholic · 2022-09-27T08:17:27Z

What changes were proposed in this pull request?

This PR proposes to update the docstring of DataFrame.to_orc, since pandas.DataFrame.to_orc is supported from pandas 1.5.0, but the behavior is a bit different.

Why are the changes needed?

From pandas 1.5.0, they support writing the DataFrame to ORC files.

In pandas API on Spark, we already support this feature, but the behavior is different.

So, we should mention the difference in the documents.

Does this PR introduce any user-facing change?

Yes, documentation update.

How was this patch tested?

The existing doctest should pass.

itholic · 2022-09-27T08:18:47Z

python/pyspark/pandas/frame.py

@@ -5266,12 +5266,12 @@ def to_orc(
        **options: "OptionalPrimitiveType",
    ) -> None:
        """
-        Write the DataFrame out as a ORC file or directory.
+        Write a DataFrame to the ORC format.


FYI, It's changed just for matching the pandas document.

zhengruifeng

LGTM
BTW, is there similar difference in other methods like to_parquet?

itholic · 2022-09-27T08:54:10Z

Yeah, I think maybe we should also address the other I/O functions if there is behavior differences.

We already document about the difference for almost I/O functions, but seems like there are still some missing docs.

Let me file a separate ticket to address them in a one PR.

HyukjinKwon · 2022-09-27T09:18:57Z

python/pyspark/pandas/frame.py


        Parameters
        ----------
        path : str, required
-            Path to write to.
+            Path to write to. It's required in pandas API on Spark whereas optional in pandas.


It says path : str, required. should we mention that it's required here? looks duplicated.

python/pyspark/pandas/frame.py

bjornjorgensen · 2022-09-27T10:27:08Z

This is not the same.
pandas API on Spark
or
pandas-on-Spark

Which one do we use?

xinrong-meng · 2022-09-27T21:59:46Z

pandas-on-Spark is more likely to be a developers' reference in the source code, whereas pandas API on Spark is the official, user-facing name. Hope that helps :) @bjornjorgensen

itholic · 2022-09-28T23:59:59Z

Yeah, and more specifically, the "pandas-on-Spark" is used when the another noun follows right after "pandas API on Spark".

For example,

"pandas API on Spark DataFrame is distributed"

When reading the above sentence, the readers may confuse between:

is this indicate pandas API on "Spark DataFrame" ?
is this indicate "pandas API on Spark" DataFrame ?

So, we use "pandas-on-Spark DataFrame" in this case to make it more clear.

HyukjinKwon · 2022-09-29T00:34:44Z

Merged to master.

[SPARK-40580] Update the document for DataFrame.to_orc

76ec4a0

itholic changed the title ~~[SPARK-40580] Update the document for DataFrame.to_orc.~~ [SPARK-40580][PS] Update the document for DataFrame.to_orc. Sep 27, 2022

itholic changed the title ~~[SPARK-40580][PS] Update the document for DataFrame.to_orc.~~ [SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc. Sep 27, 2022

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Sep 27, 2022

itholic commented Sep 27, 2022

View reviewed changes

zhengruifeng approved these changes Sep 27, 2022

View reviewed changes

HyukjinKwon reviewed Sep 27, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Sep 27, 2022

View reviewed changes

python/pyspark/pandas/frame.py Outdated Show resolved Hide resolved

xinrong-meng approved these changes Sep 27, 2022

View reviewed changes

itholic and others added 2 commits September 29, 2022 09:03

resolved the comments

490930b

Update python/pyspark/pandas/frame.py

7829806

HyukjinKwon closed this in 31aadc4 Sep 29, 2022

itholic deleted the SPARK-40580 branch April 22, 2023 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`. #38018

[SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`. #38018

itholic commented Sep 27, 2022

itholic Sep 27, 2022

zhengruifeng left a comment

itholic commented Sep 27, 2022 •

edited

HyukjinKwon Sep 27, 2022

bjornjorgensen commented Sep 27, 2022

xinrong-meng commented Sep 27, 2022

itholic commented Sep 28, 2022 •

edited

HyukjinKwon commented Sep 29, 2022

[SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc. #38018

[SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc. #38018

Conversation

itholic commented Sep 27, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

itholic Sep 27, 2022

Choose a reason for hiding this comment

zhengruifeng left a comment

Choose a reason for hiding this comment

itholic commented Sep 27, 2022 • edited

HyukjinKwon Sep 27, 2022

Choose a reason for hiding this comment

bjornjorgensen commented Sep 27, 2022

xinrong-meng commented Sep 27, 2022

itholic commented Sep 28, 2022 • edited

HyukjinKwon commented Sep 29, 2022

[SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`. #38018

[SPARK-40580][PS][DOCS] Update the document for `DataFrame.to_orc`. #38018

itholic commented Sep 27, 2022 •

edited

itholic commented Sep 28, 2022 •

edited