[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter by WweiL · Pull Request #47835 · apache/spark

WweiL · 2024-08-21T22:20:44Z

What changes were proposed in this pull request?

560c083 unintentionally made dropDuplicates(subset=["col"]) doesn't work, this patches this scenario.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit test

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-08-21T22:25:53Z

cc @HyukjinKwon @allisonwang-db @itholic PTAL!

WweiL · 2024-08-21T22:46:57Z

python/pyspark/sql/tests/test_dataframe.py

-            errorClass="NOT_STR",
-            messageParameters={"arg_name": "subset", "arg_type": "NoneType"},
-        )
-


@itholic Please let me know if deleting this test case sounds good to you...

The way I made named parameter work is to redefine the parameter as:

def dropDuplicates( self, subset: Optional[Union[str, List[str]]] = None, *subset_varargs: str ) -> ParentDataFrame:

But this means that when "subset" is None, it can mean two things:

dropDuplicates(None)

dropDuplicates()

With my change it looks it's not possible to distinguish these two...

Let's add it to the migration guide python/docs/source/migration_guide/pyspark_upgrade.rst. Also you can leverage _NoValue instance to distinguish None from not setting a value.

Thank you! Changed to use _NoValue here

allisonwang-db

dropDuplicates is a very widely used API and we need to be very careful here. Can we add more tests to see if it introduce any behavior changes?

allisonwang-db · 2024-08-22T01:30:24Z

python/pyspark/sql/classic/dataframe.py

+        # Parameters passed in as varargs
+        # (e.g. dropDuplicates("col"), dropDuplicates("col1", "col2"), ...)
+        elif isinstance(subset, str):
+            item = [subset] + list(subset_varargs)


Can we also add some tests where it (subset and subset_varargs) has invalid values?

ah sure let me add them

allisonwang-db · 2024-08-22T01:32:21Z

python/pyspark/sql/dataframe.py

+    def dropDuplicates(
+        self, subset: Optional[Union[str, List[str], _NoValueType]] = _NoValue, *subset_varargs: str
+    ) -> "DataFrame":


Can we also update the docstring here for subset and add subset_varargs? Also add more examples?

ah sure let me add them, thanks for the suggestion!

allisonwang-db · 2024-08-22T01:32:32Z

python/pyspark/sql/tests/connect/test_connect_basic.py

+        df = self.connect.read.table(self.tbl_name2)
+        df2 = self.spark.read.table(self.tbl_name2)


why change the table here?

It's because tbl_name1 only has two fields and I want to test 3 fields

allisonwang-db · 2024-08-26T20:43:53Z

python/pyspark/sql/dataframe.py

        subset : list of column names, optional
            List of columns to use for duplicate comparison (default All columns).

+        subset_varargs : optional arguments used for supporting variable-length argument.


nit: let's also add supported version for this parameter

allisonwang-db · 2024-08-26T20:47:09Z

python/pyspark/sql/dataframe.py

+        Deduplicate values on 'name' and 'height' columns.
+
+        >>> df.dropDuplicates(subset=['name', 'height']).show()
+        +-----+---+------+
+        | name|age|height|
+        +-----+---+------+
+        |Alice|  5|    80|
+        +-----+---+------+


Hmm this example is exactly the same as the above one. It can be confusing to users whether they should use dropDuplicates(subset=['name', 'height']) or directly use dropDuplicates('name', 'height')

allisonwang-db · 2024-08-26T20:47:22Z

python/pyspark/sql/dataframe.py

+
+         Deduplicate values on 'value' columns.
+
+         >>> df.dropDuplicatesWithinWatermark(subset=['value'])  # doctest: +SKIP


allisonwang-db · 2024-08-26T20:57:36Z

My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:

dropDuplicates("c1", "c2")
dropDuplicates(["c1", "c2"])

Which one should users choose?

P.S. Pandas / Pandas on Spark uses the subset=["c1", "c2"] pattern (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html and https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.drop_duplicates.html). It could be confusing to make the PySpark API different.

…tesWIthinWatermark should accept variable length args ### What changes were proposed in this pull request? Per conversation from #47835 (comment), we will revert 560c083 for API parity with Pandas API ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? Yes, reverting the API would reenable user to use `dropDuplicates(subset=xxx)` ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47916 from WweiL/revert-dropDuplicates-api. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

WweiL added 2 commits August 21, 2024 15:15

my fault

459a141

format

d1462e6

github-actions bot added SQL PYTHON CONNECT labels Aug 21, 2024

WweiL added 2 commits August 21, 2024 15:22

fmt

4ab7e52

done

dfec86e

looks like I need to delete this test...

bcd877f

WweiL commented Aug 21, 2024

View reviewed changes

use novalue

da37d9d

WweiL requested a review from HyukjinKwon August 22, 2024 01:05

allisonwang-db reviewed Aug 22, 2024

View reviewed changes

Allison's comments

ee9b60e

WweiL requested a review from allisonwang-db August 22, 2024 20:31

HyukjinKwon approved these changes Aug 23, 2024

View reviewed changes

allisonwang-db reviewed Aug 26, 2024

View reviewed changes

WweiL mentioned this pull request Aug 28, 2024

[SPARK-48482][PYTHON][FOLLOWUP] Revert dropDuplicates and dropDuplicatesWIthinWatermark should accept variable length args #47916

Closed

WweiL closed this Aug 28, 2024

		df = self.connect.read.table(self.tbl_name2)
		df2 = self.spark.read.table(self.tbl_name2)


		Deduplicate values on 'value' columns.

		>>> df.dropDuplicatesWithinWatermark(subset=['value']) # doctest: +SKIP

Conversation

WweiL commented Aug 21, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

WweiL commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WweiL Aug 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

allisonwang-db commented Aug 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WweiL commented Aug 21, 2024 •

edited

Loading

WweiL Aug 22, 2024 •

edited

Loading