[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns by dchvn · Pull Request #35240 · apache/spark

dchvn · 2022-01-18T09:26:54Z

What changes were proposed in this pull request?

Fix select subset with duplicated columns of ps.DataFrame

Why are the changes needed?

Currently, when select subset with duplicated columns of ps.DataFrame we will face an exception

ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

We should fix it and follow the pandas's behavior

Does this PR introduce any user-facing change?

Yes,

Before this PR

>>> psdf
   a
0  1
1  2
2  3
3  4
>>> psdf[['a', 'a']]
Traceback (most recent call last):
  ...
  File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

>>> psdf.loc[:, ['a', 'a']]
Traceback (most recent call last):
  ...
  File "/u02/venv3.9-2/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements

After this PR

>>> psdf[['a', 'a']]
   a  a
0  1  1
1  2  2
2  3  3
3  4  4

>>> psdf.loc[:, ['a', 'a']]
   a  a
0  1  1
1  2  2
2  3  3
3  4  4

How was this patch tested?

unit test

…umns

dchvn · 2022-01-18T09:29:15Z

CC @Yikun @HyukjinKwon @ueshin @itholic Please take a look when you find sometime, thanks!

HyukjinKwon · 2022-01-18T23:50:36Z

python/pyspark/pandas/internal.py

+
+        # SPARK-37930: Fix DataFrame select subset with duplicated columns
+        # Remove duplicated columns before select `data_columns`
+        pdf = pdf.loc[:, ~pdf.columns.duplicated()][data_columns]


ohh. sorry I missed https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html?highlight=duplicate#do-not-use-duplicated-column-names. Actually pandas API on Spark does not officially support duplicated column names for now.

But It think we can fix this as the fix seems minimized. BTW, can we do pdf[~pdf.columns.duplicated()][data_columns]?

can we do pdf[~pdf.columns.duplicated()][data_columns]?

I think pdf[~pdf.columns.duplicated()] is used to select rows. Do you mind if I keep using .loc

Should we change the docs in this PR

Actually pandas API on Spark does not officially support duplicated column names for now.

learned

Should we change the docs in this PR

import pyspark.pandas as ps psdf = ps.DataFrame({'a': [1, 2], 'b': [3, 4]}) psdf.columns = ["a", "a"] print(psdf)

FYI, this still raise the exception pyspark.sql.utils.AnalysisException: Reference 'a' is ambiguous, could be: a, a.. So, seems that the document is still valid

And sorry late to say, I think this PR is only for self combine case, right? so another option for original PR, since we have made it clear in the document that the same column is not supported, should we directly throw an exception when self combine? Let's see other idea.

@Yikun Thanks for the clarification. I prefer to change behavior to follow pandas, but do not mind if we decide to keep this difference.

Maybe throwing an exception is an idea too .. actually we have a lot of places like this, not only here. so let's just fix this it first, and revisit the duplicate names later separately.

Yikun

Thanks for fix, LGTM

phofl · 2022-01-20T23:16:23Z

Just as a note from the pandas site: The behavior as is on pandas is correct. Selecting duplicated columns twice should duplicate them again. This is the same as when selecting a unique column twice.

itholic · 2022-01-21T02:22:31Z

I'm just worrying about that unexpected errors may occur on the resulting DataFrame after creating duplicated columns since we're not officially support it.

For example,

>>> psdf[['a', 'a', 'a']].a
Traceback (most recent call last):
...
TypeError: to_string() got an unexpected keyword argument 'name'

whereas pandas support it

>>> pdf[['a', 'a', 'a']].a
          a  a  a
0.480706  1  1  1
0.670385  2  2  2
0.051040  3  3  3
0.815093  4  4  4
0.567992  5  5  5
0.740033  6  6  6
0.646050  7  7  7
0.058224  8  8  8
0.939080  9  9  9

So, how about we just raise the proper exception for now as mentioned at #35240 (comment) ?

…lumn name

dchvn · 2022-01-21T03:35:22Z

@itholic Thanks for you investigation. I will raise an exception to continue my original PR.
I have tested the case you mentioned and updated new test to cover it.

[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated col…

2c22a98

…umns

github-actions bot added CORE PYTHON labels Jan 18, 2022

HyukjinKwon reviewed Jan 18, 2022

View reviewed changes

Yikun approved these changes Jan 19, 2022

View reviewed changes

dchvn added 2 commits January 21, 2022 10:15

Merge remote-tracking branch 'origin/master' into SPARK-37930

ea8fe84

add tests to cover select columns from the dataframe has duplicate co…

ddf1e3b

…lumn name

dchvn marked this pull request as draft January 21, 2022 06:02

dchvn closed this Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240

[SPARK-37930][PYTHON] Fix DataFrame select subset with duplicated columns#35240
dchvn wants to merge 3 commits intoapache:masterfrom
dchvn:SPARK-37930

dchvn commented Jan 18, 2022 •

edited

Loading

Uh oh!

dchvn commented Jan 18, 2022

Uh oh!

HyukjinKwon Jan 18, 2022

Uh oh!

dchvn Jan 19, 2022

Uh oh!

dchvn Jan 19, 2022

Uh oh!

Yikun Jan 19, 2022 •

edited

Loading

Uh oh!

dchvn Jan 19, 2022

Uh oh!

HyukjinKwon Jan 19, 2022

Uh oh!

Yikun left a comment

Uh oh!

phofl commented Jan 20, 2022

Uh oh!

itholic commented Jan 21, 2022 •

edited

Loading

Uh oh!

dchvn commented Jan 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

Conversation

dchvn commented Jan 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dchvn commented Jan 18, 2022

Uh oh!

HyukjinKwon Jan 18, 2022

Choose a reason for hiding this comment

Uh oh!

dchvn Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

dchvn Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

Yikun Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dchvn Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 19, 2022

Choose a reason for hiding this comment

Uh oh!

Yikun left a comment

Choose a reason for hiding this comment

Uh oh!

phofl commented Jan 20, 2022

Uh oh!

itholic commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dchvn commented Jan 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Comments

dchvn commented Jan 18, 2022 •

edited

Loading

Yikun Jan 19, 2022 •

edited

Loading

itholic commented Jan 21, 2022 •

edited

Loading