-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40580][PS][DOCS] Update the document for DataFrame.to_orc
.
#38018
Conversation
DataFrame.to_orc
.DataFrame.to_orc
.
DataFrame.to_orc
.DataFrame.to_orc
.
@@ -5266,12 +5266,12 @@ def to_orc( | |||
**options: "OptionalPrimitiveType", | |||
) -> None: | |||
""" | |||
Write the DataFrame out as a ORC file or directory. | |||
Write a DataFrame to the ORC format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
BTW, is there similar difference in other methods like to_parquet
?
Yeah, I think maybe we should also address the other I/O functions if there is behavior differences. We already document about the difference for almost I/O functions, but seems like there are still some missing docs. Let me file a separate ticket to address them in a one PR. |
python/pyspark/pandas/frame.py
Outdated
|
||
Parameters | ||
---------- | ||
path : str, required | ||
Path to write to. | ||
Path to write to. It's required in pandas API on Spark whereas optional in pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says path : str, required
. should we mention that it's required here? looks duplicated.
This is not the same. Which one do we use? |
pandas-on-Spark is more likely to be a developers' reference in the source code, whereas |
Yeah, and more specifically, the "pandas-on-Spark" is used when the another noun follows right after "pandas API on Spark". For example, "pandas API on Spark DataFrame is distributed" When reading the above sentence, the readers may confuse between:
So, we use "pandas-on-Spark DataFrame" in this case to make it more clear. |
Merged to master. |
What changes were proposed in this pull request?
This PR proposes to update the docstring of
DataFrame.to_orc
, sincepandas.DataFrame.to_orc
is supported from pandas 1.5.0, but the behavior is a bit different.Why are the changes needed?
From pandas 1.5.0, they support writing the DataFrame to ORC files.
In pandas API on Spark, we already support this feature, but the behavior is different.
So, we should mention the difference in the documents.
Does this PR introduce any user-facing change?
Yes, documentation update.
How was this patch tested?
The existing doctest should pass.