Skip to content

[SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join#43039

Closed
allisonwang-db wants to merge 4 commits intoapache:masterfrom
allisonwang-db:spark-45220-refine-join
Closed

[SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join#43039
allisonwang-db wants to merge 4 commits intoapache:masterfrom
allisonwang-db:spark-45220-refine-join

Conversation

@allisonwang-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR refines the docstring of DataFrame.join by adding more examples and explanations.

Why are the changes needed?

To improve PySpark documentation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

doctest

Was this patch authored or co-authored using generative AI tooling?

No

Comment on lines 2708 to 2713
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @cloud-fan please let me know if this makes sense to you

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but we should probably use self-join in the example, so that people can understand why it's better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it output name, age columns from the left table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. I can add another example to make it more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to mention it? It's the same as inner join. We added cross join as we want to forbid inner join without join condition by default, but this restriction has been removed already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, as it's not really widely used. I will remove this example.

@allisonwang-db allisonwang-db force-pushed the spark-45220-refine-join branch from 8631fa8 to 863af87 Compare October 5, 2023 23:22
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just rerun tests

@allisonwang-db
Copy link
Contributor Author

cc @cloud-fan @HyukjinKwon the test failure seems unrelated.

Examples
--------
The following performs a full outer join between ``df1`` and ``df2``.
The following examples demonstrate various join types between ``df1`` and ``df2``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's not just df1 and df2 (no need to fix unless other changes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this change here.

Overall, LGTM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Examples
--------
The following performs a full outer join between ``df1`` and ``df2``.
The following examples demonstrate various join types between ``df1`` and ``df2``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this change here.

Overall, LGTM.

they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`.

>>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name))
>>> joined.show()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this example does not work with spark connect:

pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]

cc @zhengruifeng @HyukjinKwon

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we implement this join API differently in Spark Connect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work as the column in spark connect contain dataframe id. @zhengruifeng can you take a look?

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we exclude those examples in this PR, and mind filing JIRAs for both issues @allisonwang-db?

Copy link
Contributor Author

@allisonwang-db allisonwang-db Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Created SPARK-45509

Comment on lines +2709 to +2712
>>> df.join(df, df.name == df.name, "outer").select(df.name).show() # doctest: +SKIP
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous...
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, this works in Spark connect!

>>> df.join(df, df.name == df.name, "outer").select(df.name).show()
+-----+
| name|
+-----+
|Alice|
|Alice|
|  Bob|
|  Bob|
+-----+

also cc @cloud-fan

@allisonwang-db
Copy link
Contributor Author

friendly ping @HyukjinKwon @zhengruifeng the test failures are not related.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants