[SPARK-40006][PYTHON][DOCS] Make pyspark.sql.group examples self-contained #37437

HyukjinKwon · 2022-08-08T12:23:24Z

What changes were proposed in this pull request?

This PR proposes to improve the examples in pyspark.sql.group by making each example self-contained with a brief explanation and a bit more realistic example.

Why are the changes needed?

To make the documentation more readable and able to copy and paste directly in PySpark shell.

Does this PR introduce any user-facing change?

Yes, it changes the documentation

How was this patch tested?

Manually ran each doctest.

…ntation

Transurgeon · 2022-08-08T14:25:04Z

Good morning Hyukjin, I saw your email from the mailing list and I can try to help.

Does self-contained simply mean that we need to initialise the dataframe in each shell example?
Also what do you mean by more realistic examples?

Just to make sure for building the docs, I need to run these two commands right?

HyukjinKwon · 2022-08-09T02:12:21Z

Hey @Transurgeon, thanks for taking a look.

Does self-contained simply mean that we need to initialise the dataframe in each shell example?

Yes. Plus, we should add some description for each example. Basiclaly I would like to follow what the pandas do.

Also what do you mean by more realistic examples?

Something meaningful. The operation has to do something. For example, spark.createDataFrame([1]).count() doesn't much make sense.

Just to make sure for building the docs, I need to run these two commands right?

Yes. As long as the format is consistent, you might not need to build and validate it by yourself though.

HyukjinKwon · 2022-08-09T02:13:04Z

cc @zero323 @ueshin @viirya @zhengruifeng @xinrong-meng WDYT?

viirya

Looks good to me.

viirya

It looks good to have them serve as examples and tests at the same time.

HyukjinKwon · 2022-08-09T03:12:18Z

Thanks all.

Merged to master.

zhengruifeng · 2022-08-09T03:23:03Z

Late LGTM + 1

Yikun · 2022-08-09T12:48:37Z

python/pyspark/sql/group.py

-        >>> df5.groupBy("sales.year").pivot("sales.course").sum("sales.earnings").collect()
-        [Row(year=2012, Java=20000, dotNET=15000), Row(year=2013, Java=30000, dotNET=48000)]
+        >>> from pyspark.sql import Row
+        >>> spark = SparkSession.builder.master("local[4]").appName("sql.group tests").getOrCreate()


Sorry to post review, just curious why not use the spark directly in here? I think this example is a little bit different with others.

For all PRs in this series, I think sc (Spark context) and spark (SparkSession) can define in the bottom, and use it directly in every doctest (just like pyspark shell, sc and spark already available), right?

Oh yeah, we should remove this. It was my mistake.

…d duplicated Spark session initialization ### What changes were proposed in this pull request? This PR is a followup of #37437 which missed to remove unused `sc` and duplicated Spark session initialization. ### Why are the changes needed? To make the consistent example, and remove unused variables. ### Does this PR introduce _any_ user-facing change? No. It's a documentation change. However, the previous PR is not released yet. ### How was this patch tested? Ci in this PR should test it out. Closes #37457 from HyukjinKwon/SPARK-40006-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Self-contained examples with parameter descriptions in PySpark docume…

855fe13

…ntation

github-actions bot added CORE PYTHON SQL labels Aug 8, 2022

viirya approved these changes Aug 9, 2022

View reviewed changes

viirya reviewed Aug 9, 2022

View reviewed changes

HyukjinKwon closed this in 527cce5 Aug 9, 2022

Yikun reviewed Aug 9, 2022

View reviewed changes

HyukjinKwon mentioned this pull request Aug 10, 2022

[SPARK-40006][PYTHON][DOCS][FOLLOW-UP] Remove unused Spark context and duplicated Spark session initialization #37457

Closed

HyukjinKwon deleted the SPARK-40006 branch January 15, 2024 00:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40006][PYTHON][DOCS] Make pyspark.sql.group examples self-contained #37437

[SPARK-40006][PYTHON][DOCS] Make pyspark.sql.group examples self-contained #37437

HyukjinKwon commented Aug 8, 2022

Transurgeon commented Aug 8, 2022

HyukjinKwon commented Aug 9, 2022

HyukjinKwon commented Aug 9, 2022

viirya left a comment

viirya left a comment

HyukjinKwon commented Aug 9, 2022

zhengruifeng commented Aug 9, 2022

Yikun Aug 9, 2022

HyukjinKwon Aug 9, 2022

HyukjinKwon Aug 10, 2022

[SPARK-40006][PYTHON][DOCS] Make pyspark.sql.group examples self-contained #37437

[SPARK-40006][PYTHON][DOCS] Make pyspark.sql.group examples self-contained #37437

Conversation

HyukjinKwon commented Aug 8, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Transurgeon commented Aug 8, 2022

HyukjinKwon commented Aug 9, 2022

HyukjinKwon commented Aug 9, 2022

viirya left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Aug 9, 2022

zhengruifeng commented Aug 9, 2022

Yikun Aug 9, 2022

Choose a reason for hiding this comment

HyukjinKwon Aug 9, 2022

Choose a reason for hiding this comment

HyukjinKwon Aug 10, 2022

Choose a reason for hiding this comment