[SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join#43039
[SPARK-45220][PYTHON][DOCS] Refine docstring of DataFrame.join#43039allisonwang-db wants to merge 4 commits intoapache:masterfrom
Conversation
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
cc @cloud-fan please let me know if this makes sense to you
There was a problem hiding this comment.
It makes sense, but we should probably use self-join in the example, so that people can understand why it's better
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
does it output name, age columns from the left table?
There was a problem hiding this comment.
Yup. I can add another example to make it more clear.
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
do we really need to mention it? It's the same as inner join. We added cross join as we want to forbid inner join without join condition by default, but this restriction has been removed already.
There was a problem hiding this comment.
Probably not, as it's not really widely used. I will remove this example.
8631fa8 to
863af87
Compare
|
cc @cloud-fan @HyukjinKwon the test failure seems unrelated. |
python/pyspark/sql/dataframe.py
Outdated
| Examples | ||
| -------- | ||
| The following performs a full outer join between ``df1`` and ``df2``. | ||
| The following examples demonstrate various join types between ``df1`` and ``df2``. |
There was a problem hiding this comment.
nit: it's not just df1 and df2 (no need to fix unless other changes)
There was a problem hiding this comment.
I am neutral on this change here.
Overall, LGTM.
python/pyspark/sql/dataframe.py
Outdated
| Examples | ||
| -------- | ||
| The following performs a full outer join between ``df1`` and ``df2``. | ||
| The following examples demonstrate various join types between ``df1`` and ``df2``. |
There was a problem hiding this comment.
I am neutral on this change here.
Overall, LGTM.
python/pyspark/sql/dataframe.py
Outdated
| they will appear with `NULL` in the `name` column of `df`, and vice versa for `df2`. | ||
|
|
||
| >>> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) | ||
| >>> joined.show() |
There was a problem hiding this comment.
It looks like this example does not work with spark connect:
pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
:- LocalRelation [name#64, age#65L]
+- LocalRelation [name#78, height#79L]
There was a problem hiding this comment.
did we implement this join API differently in Spark Connect?
There was a problem hiding this comment.
It should work as the column in spark connect contain dataframe id. @zhengruifeng can you take a look?
There was a problem hiding this comment.
Can we exclude those examples in this PR, and mind filing JIRAs for both issues @allisonwang-db?
| >>> df.join(df, df.name == df.name, "outer").select(df.name).show() # doctest: +SKIP | ||
| Traceback (most recent call last): | ||
| ... | ||
| pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... |
There was a problem hiding this comment.
However, this works in Spark connect!
>>> df.join(df, df.name == df.name, "outer").select(df.name).show()
+-----+
| name|
+-----+
|Alice|
|Alice|
| Bob|
| Bob|
+-----+
also cc @cloud-fan
|
friendly ping @HyukjinKwon @zhengruifeng the test failures are not related. |
|
Merged to master. |
What changes were proposed in this pull request?
This PR refines the docstring of
DataFrame.joinby adding more examples and explanations.Why are the changes needed?
To improve PySpark documentation.
Does this PR introduce any user-facing change?
No
How was this patch tested?
doctest
Was this patch authored or co-authored using generative AI tooling?
No