Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc. #38018

Closed
wants to merge 3 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Sep 27, 2022

What changes were proposed in this pull request?

This PR proposes to update the docstring of DataFrame.to_orc, since pandas.DataFrame.to_orc is supported from pandas 1.5.0, but the behavior is a bit different.

Why are the changes needed?

From pandas 1.5.0, they support writing the DataFrame to ORC files.

In pandas API on Spark, we already support this feature, but the behavior is different.

So, we should mention the difference in the documents.

Does this PR introduce any user-facing change?

Yes, documentation update.

How was this patch tested?

The existing doctest should pass.

@itholic itholic changed the title [SPARK-40580] Update the document for DataFrame.to_orc. [SPARK-40580][PS] Update the document for DataFrame.to_orc. Sep 27, 2022
@itholic itholic changed the title [SPARK-40580][PS] Update the document for DataFrame.to_orc. [SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc. Sep 27, 2022
@@ -5266,12 +5266,12 @@ def to_orc(
**options: "OptionalPrimitiveType",
) -> None:
"""
Write the DataFrame out as a ORC file or directory.
Write a DataFrame to the ORC format.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, It's changed just for matching the pandas document.
Screen Shot 2022-09-27 at 5 18 04 PM

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
BTW, is there similar difference in other methods like to_parquet?

@itholic
Copy link
Contributor Author

itholic commented Sep 27, 2022

Yeah, I think maybe we should also address the other I/O functions if there is behavior differences.

We already document about the difference for almost I/O functions, but seems like there are still some missing docs.

Let me file a separate ticket to address them in a one PR.


Parameters
----------
path : str, required
Path to write to.
Path to write to. It's required in pandas API on Spark whereas optional in pandas.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says path : str, required. should we mention that it's required here? looks duplicated.

@bjornjorgensen
Copy link
Contributor

This is not the same.
pandas API on Spark
or
pandas-on-Spark

Which one do we use?

@xinrong-meng
Copy link
Member

pandas-on-Spark is more likely to be a developers' reference in the source code, whereas pandas API on Spark is the official, user-facing name. Hope that helps :) @bjornjorgensen

@itholic
Copy link
Contributor Author

itholic commented Sep 28, 2022

Yeah, and more specifically, the "pandas-on-Spark" is used when the another noun follows right after "pandas API on Spark".

For example,

"pandas API on Spark DataFrame is distributed"

When reading the above sentence, the readers may confuse between:

  • is this indicate pandas API on "Spark DataFrame" ?
  • is this indicate "pandas API on Spark" DataFrame ?

So, we use "pandas-on-Spark DataFrame" in this case to make it more clear.

@HyukjinKwon
Copy link
Member

Merged to master.

@itholic itholic deleted the SPARK-40580 branch April 22, 2023 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants