Skip to content

[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240

Closed
dchvn wants to merge 3 commits intoapache:masterfrom
dchvn:SPARK-37930
Closed

[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240
dchvn wants to merge 3 commits intoapache:masterfrom
dchvn:SPARK-37930

Conversation

@dchvn
Copy link
Contributor

@dchvn dchvn commented Jan 18, 2022

What changes were proposed in this pull request?

Fix select subset with duplicated columns of ps.DataFrame

Why are the changes needed?

Currently, when select subset with duplicated columns of ps.DataFrame we will face an exception

ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

We should fix it and follow the pandas's behavior

Does this PR introduce any user-facing change?

Yes,

Before this PR

>>> psdf
   a
0  1
1  2
2  3
3  4
>>> psdf[['a', 'a']]
Traceback (most recent call last):
  ...
  File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

>>> psdf.loc[:, ['a', 'a']]
Traceback (most recent call last):
  ...
  File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements 

After this PR

>>> psdf[['a', 'a']]
   a  a
0  1  1
1  2  2
2  3  3
3  4  4

>>> psdf.loc[:, ['a', 'a']]
   a  a
0  1  1
1  2  2
2  3  3
3  4  4

How was this patch tested?

unit test

@dchvn
Copy link
Contributor Author

dchvn commented Jan 18, 2022

CC @Yikun @HyukjinKwon @ueshin @itholic Please take a look when you find sometime, thanks!


# SPARK-37930: Fix DataFrame select subset with duplicated columns
# Remove duplicated columns before select `data_columns`
pdf = pdf.loc[:, ~pdf.columns.duplicated()][data_columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh. sorry I missed https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html?highlight=duplicate#do-not-use-duplicated-column-names. Actually pandas API on Spark does not officially support duplicated column names for now.

But It think we can fix this as the fix seems minimized. BTW, can we do pdf[~pdf.columns.duplicated()][data_columns]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do pdf[~pdf.columns.duplicated()][data_columns]?

I think pdf[~pdf.columns.duplicated()] is used to select rows. Do you mind if I keep using .loc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change the docs in this PR

Copy link
Member

@Yikun Yikun Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually pandas API on Spark does not officially support duplicated column names for now.

learned

Should we change the docs in this PR

        import pyspark.pandas as ps
        psdf = ps.DataFrame({'a': [1, 2], 'b': [3, 4]})
        psdf.columns = ["a", "a"]
        print(psdf)

FYI, this still raise the exception pyspark.sql.utils.AnalysisException: Reference 'a' is ambiguous, could be: a, a.. So, seems that the document is still valid

And sorry late to say, I think this PR is only for self combine case, right? so another option for original PR, since we have made it clear in the document that the same column is not supported, should we directly throw an exception when self combine? Let's see other idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yikun Thanks for the clarification. I prefer to change behavior to follow pandas, but do not mind if we decide to keep this difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throwing an exception is an idea too .. actually we have a lot of places like this, not only here. so let's just fix this it first, and revisit the duplicate names later separately.

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fix, LGTM

@phofl
Copy link

phofl commented Jan 20, 2022

Just as a note from the pandas site: The behavior as is on pandas is correct. Selecting duplicated columns twice should duplicate them again. This is the same as when selecting a unique column twice.

@itholic
Copy link
Contributor

itholic commented Jan 21, 2022

I'm just worrying about that unexpected errors may occur on the resulting DataFrame after creating duplicated columns since we're not officially support it.

For example,

>>> psdf[['a', 'a', 'a']].a
Traceback (most recent call last):
...
TypeError: to_string() got an unexpected keyword argument 'name'

whereas pandas support it

>>> pdf[['a', 'a', 'a']].a
          a  a  a
0.480706  1  1  1
0.670385  2  2  2
0.051040  3  3  3
0.815093  4  4  4
0.567992  5  5  5
0.740033  6  6  6
0.646050  7  7  7
0.058224  8  8  8
0.939080  9  9  9

So, how about we just raise the proper exception for now as mentioned at #35240 (comment) ?

@dchvn
Copy link
Contributor Author

dchvn commented Jan 21, 2022

@itholic Thanks for you investigation. I will raise an exception to continue my original PR.
I have tested the case you mentioned and updated new test to cover it.

@dchvn dchvn marked this pull request as draft January 21, 2022 06:02
@dchvn dchvn closed this Apr 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments