[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240
[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240dchvn wants to merge 3 commits intoapache:masterfrom
Conversation
|
CC @Yikun @HyukjinKwon @ueshin @itholic Please take a look when you find sometime, thanks! |
|
|
||
| # SPARK-37930: Fix DataFrame select subset with duplicated columns | ||
| # Remove duplicated columns before select `data_columns` | ||
| pdf = pdf.loc[:, ~pdf.columns.duplicated()][data_columns] |
There was a problem hiding this comment.
ohh. sorry I missed https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html?highlight=duplicate#do-not-use-duplicated-column-names. Actually pandas API on Spark does not officially support duplicated column names for now.
But It think we can fix this as the fix seems minimized. BTW, can we do pdf[~pdf.columns.duplicated()][data_columns]?
There was a problem hiding this comment.
can we do
pdf[~pdf.columns.duplicated()][data_columns]?
I think pdf[~pdf.columns.duplicated()] is used to select rows. Do you mind if I keep using .loc
There was a problem hiding this comment.
Actually pandas API on Spark does not officially support duplicated column names for now.
learned
Should we change the docs in this PR
import pyspark.pandas as ps
psdf = ps.DataFrame({'a': [1, 2], 'b': [3, 4]})
psdf.columns = ["a", "a"]
print(psdf)
FYI, this still raise the exception pyspark.sql.utils.AnalysisException: Reference 'a' is ambiguous, could be: a, a.. So, seems that the document is still valid
And sorry late to say, I think this PR is only for self combine case, right? so another option for original PR, since we have made it clear in the document that the same column is not supported, should we directly throw an exception when self combine? Let's see other idea.
There was a problem hiding this comment.
@Yikun Thanks for the clarification. I prefer to change behavior to follow pandas, but do not mind if we decide to keep this difference.
There was a problem hiding this comment.
Maybe throwing an exception is an idea too .. actually we have a lot of places like this, not only here. so let's just fix this it first, and revisit the duplicate names later separately.
|
Just as a note from the pandas site: The behavior as is on pandas is correct. Selecting duplicated columns twice should duplicate them again. This is the same as when selecting a unique column twice. |
|
I'm just worrying about that unexpected errors may occur on the resulting DataFrame after creating duplicated columns since we're not officially support it. For example, >>> psdf[['a', 'a', 'a']].a
Traceback (most recent call last):
...
TypeError: to_string() got an unexpected keyword argument 'name'whereas pandas support it >>> pdf[['a', 'a', 'a']].a
a a a
0.480706 1 1 1
0.670385 2 2 2
0.051040 3 3 3
0.815093 4 4 4
0.567992 5 5 5
0.740033 6 6 6
0.646050 7 7 7
0.058224 8 8 8
0.939080 9 9 9So, how about we just raise the proper exception for now as mentioned at #35240 (comment) ? |
|
@itholic Thanks for you investigation. I will raise an exception to continue my original PR. |
What changes were proposed in this pull request?
Fix select subset with duplicated columns of ps.DataFrame
Why are the changes needed?
Currently, when select subset with duplicated columns of ps.DataFrame we will face an exception
We should fix it and follow the pandas's behavior
Does this PR introduce any user-facing change?
Yes,
Before this PR
After this PR
How was this patch tested?
unit test