Skip to content

[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter#47835

Closed
WweiL wants to merge 7 commits intoapache:masterfrom
WweiL:dropDuplicates-followup
Closed

[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter#47835
WweiL wants to merge 7 commits intoapache:masterfrom
WweiL:dropDuplicates-followup

Conversation

@WweiL
Copy link
Contributor

@WweiL WweiL commented Aug 21, 2024

What changes were proposed in this pull request?

560c083 unintentionally made dropDuplicates(subset=["col"]) doesn't work, this patches this scenario.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

@WweiL
Copy link
Contributor Author

WweiL commented Aug 21, 2024

cc @HyukjinKwon @allisonwang-db @itholic PTAL!

errorClass="NOT_STR",
messageParameters={"arg_name": "subset", "arg_type": "NoneType"},
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic Please let me know if deleting this test case sounds good to you...

The way I made named parameter work is to redefine the parameter as:

def dropDuplicates(
        self, subset: Optional[Union[str, List[str]]] = None, *subset_varargs: str
    ) -> ParentDataFrame:

But this means that when "subset" is None, it can mean two things:

  1. dropDuplicates(None)
  2. dropDuplicates()

With my change it looks it's not possible to distinguish these two...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add it to the migration guide python/docs/source/migration_guide/pyspark_upgrade.rst. Also you can leverage _NoValue instance to distinguish None from not setting a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Changed to use _NoValue here

@WweiL WweiL requested a review from HyukjinKwon August 22, 2024 01:05
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropDuplicates is a very widely used API and we need to be very careful here. Can we add more tests to see if it introduce any behavior changes?

# Parameters passed in as varargs
# (e.g. dropDuplicates("col"), dropDuplicates("col1", "col2"), ...)
elif isinstance(subset, str):
item = [subset] + list(subset_varargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add some tests where it (subset and subset_varargs) has invalid values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sure let me add them

Comment on lines +4573 to +4575
def dropDuplicates(
self, subset: Optional[Union[str, List[str], _NoValueType]] = _NoValue, *subset_varargs: str
) -> "DataFrame":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also update the docstring here for subset and add subset_varargs? Also add more examples?

Copy link
Contributor Author

@WweiL WweiL Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sure let me add them, thanks for the suggestion!

Comment on lines +666 to +667
df = self.connect.read.table(self.tbl_name2)
df2 = self.spark.read.table(self.tbl_name2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change the table here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because tbl_name1 only has two fields and I want to test 3 fields

@WweiL WweiL requested a review from allisonwang-db August 22, 2024 20:31
subset : list of column names, optional
List of columns to use for duplicate comparison (default All columns).

subset_varargs : optional arguments used for supporting variable-length argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's also add supported version for this parameter

Comment on lines +4635 to +4642
Deduplicate values on 'name' and 'height' columns.

>>> df.dropDuplicates(subset=['name', 'height']).show()
+-----+---+------+
| name|age|height|
+-----+---+------+
|Alice| 5| 80|
+-----+---+------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this example is exactly the same as the above one. It can be confusing to users whether they should use dropDuplicates(subset=['name', 'height']) or directly use dropDuplicates('name', 'height')


Deduplicate values on 'value' columns.

>>> df.dropDuplicatesWithinWatermark(subset=['value']) # doctest: +SKIP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@allisonwang-db
Copy link
Contributor

My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:

  1. dropDuplicates("c1", "c2")
  2. dropDuplicates(["c1", "c2"])

Which one should users choose?

P.S. Pandas / Pandas on Spark uses the subset=["c1", "c2"] pattern (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html and https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop_duplicates.html). It could be confusing to make the PySpark API different.

@WweiL WweiL closed this Aug 28, 2024
HyukjinKwon pushed a commit that referenced this pull request Aug 30, 2024
…tesWIthinWatermark should accept variable length args

### What changes were proposed in this pull request?

Per conversation from #47835 (comment), we will revert 560c083 for API parity with Pandas API

### Why are the changes needed?

Bug fix

### Does this PR introduce _any_ user-facing change?

Yes, reverting the API would reenable user to use `dropDuplicates(subset=xxx)`

### How was this patch tested?

Unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #47916 from WweiL/revert-dropDuplicates-api.

Authored-by: Wei Liu <wei.liu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants