[SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join by allisonwang-db · Pull Request #43039 · apache/spark

allisonwang-db · 2023-09-21T21:26:07Z

What changes were proposed in this pull request?

This PR refines the docstring of DataFrame.join by adding more examples and explanations.

Why are the changes needed?

To improve PySpark documentation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

doctest

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2023-09-21T21:26:31Z

python/pyspark/sql/dataframe.py

cc @cloud-fan please let me know if this makes sense to you

It makes sense, but we should probably use self-join in the example, so that people can understand why it's better

cloud-fan · 2023-09-22T00:56:10Z

python/pyspark/sql/dataframe.py

does it output name, age columns from the left table?

Yup. I can add another example to make it more clear.

cloud-fan · 2023-09-22T00:57:50Z

python/pyspark/sql/dataframe.py

do we really need to mention it? It's the same as inner join. We added cross join as we want to forbid inner join without join condition by default, but this restriction has been removed already.

Probably not, as it's not really widely used. I will remove this example.

srowen

Maybe just rerun tests

allisonwang-db · 2023-10-09T20:32:49Z

cc @cloud-fan @HyukjinKwon the test failure seems unrelated.

holdenk · 2023-10-10T00:32:40Z

python/pyspark/sql/dataframe.py

        Examples
        --------
-        The following performs a full outer join between ``df1`` and ``df2``.
+        The following examples demonstrate various join types between ``df1`` and ``df2``.


nit: it's not just df1 and df2 (no need to fix unless other changes)

I am neutral on this change here.

Overall, LGTM.

Good catch!

beliefer · 2023-10-10T01:59:25Z

python/pyspark/sql/dataframe.py

        Examples
        --------
-        The following performs a full outer join between ``df1`` and ``df2``.
+        The following examples demonstrate various join types between ``df1`` and ``df2``.


I am neutral on this change here.

Overall, LGTM.

allisonwang-db · 2023-10-11T03:21:41Z

python/pyspark/sql/dataframe.py

+        they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`.
+
+        >>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
+        >>> joined.show()


It looks like this example does not work with spark connect:

pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L]

cc @zhengruifeng @HyukjinKwon

did we implement this join API differently in Spark Connect?

It should work as the column in spark connect contain dataframe id. @zhengruifeng can you take a look?

Can we exclude those examples in this PR, and mind filing JIRAs for both issues @allisonwang-db?

Sounds good. Created SPARK-45509

allisonwang-db · 2023-10-11T03:23:33Z

python/pyspark/sql/dataframe.py

+        >>> df.join(df, df.name == df.name, "outer").select(df.name).show() # doctest: +SKIP
+        Traceback (most recent call last):
+        ...
+        pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...


allisonwang-db · 2023-10-18T21:19:30Z

friendly ping @HyukjinKwon @zhengruifeng the test failures are not related.

HyukjinKwon · 2023-10-19T05:25:23Z

Merged to master.

github-actions bot added SQL PYTHON labels Sep 21, 2023

allisonwang-db commented Sep 21, 2023

View reviewed changes

cloud-fan reviewed Sep 22, 2023

View reviewed changes

allisonwang-db added 2 commits October 5, 2023 15:07

refine

5c29136

address comments

863af87

allisonwang-db force-pushed the spark-45220-refine-join branch from 8631fa8 to 863af87 Compare October 5, 2023 23:22

cloud-fan approved these changes Oct 6, 2023

View reviewed changes

srowen approved these changes Oct 6, 2023

View reviewed changes

retrigger build

534a190

holdenk approved these changes Oct 10, 2023

View reviewed changes

beliefer approved these changes Oct 10, 2023

View reviewed changes

allisonwang-db commented Oct 11, 2023

View reviewed changes

address comments

a903c0f

HyukjinKwon approved these changes Oct 12, 2023

View reviewed changes

HyukjinKwon closed this in db16236 Oct 19, 2023

dongjoon-hyun mentioned this pull request May 6, 2024

[SPARK-45220][FOLLOWUP][DOCS][TESTS] Make a dataframe.join doctest deterministic #46398

Closed

Conversation

allisonwang-db commented Sep 21, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Oct 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Oct 18, 2023

Uh oh!

HyukjinKwon commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

HyukjinKwon Oct 12, 2023 •

edited

Loading

allisonwang-db Oct 12, 2023 •

edited

Loading