[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter#47835
[SPARK-48482][PYTHON][FOLLOWUP] dropDuplicates and dropDuplicatesWIthinWatermark should accept named parameter#47835WweiL wants to merge 7 commits intoapache:masterfrom
Conversation
|
cc @HyukjinKwon @allisonwang-db @itholic PTAL! |
| errorClass="NOT_STR", | ||
| messageParameters={"arg_name": "subset", "arg_type": "NoneType"}, | ||
| ) | ||
|
|
There was a problem hiding this comment.
@itholic Please let me know if deleting this test case sounds good to you...
The way I made named parameter work is to redefine the parameter as:
def dropDuplicates(
self, subset: Optional[Union[str, List[str]]] = None, *subset_varargs: str
) -> ParentDataFrame:
But this means that when "subset" is None, it can mean two things:
- dropDuplicates(None)
- dropDuplicates()
With my change it looks it's not possible to distinguish these two...
There was a problem hiding this comment.
Let's add it to the migration guide python/docs/source/migration_guide/pyspark_upgrade.rst. Also you can leverage _NoValue instance to distinguish None from not setting a value.
There was a problem hiding this comment.
Thank you! Changed to use _NoValue here
allisonwang-db
left a comment
There was a problem hiding this comment.
dropDuplicates is a very widely used API and we need to be very careful here. Can we add more tests to see if it introduce any behavior changes?
| # Parameters passed in as varargs | ||
| # (e.g. dropDuplicates("col"), dropDuplicates("col1", "col2"), ...) | ||
| elif isinstance(subset, str): | ||
| item = [subset] + list(subset_varargs) |
There was a problem hiding this comment.
Can we also add some tests where it (subset and subset_varargs) has invalid values?
There was a problem hiding this comment.
ah sure let me add them
| def dropDuplicates( | ||
| self, subset: Optional[Union[str, List[str], _NoValueType]] = _NoValue, *subset_varargs: str | ||
| ) -> "DataFrame": |
There was a problem hiding this comment.
Can we also update the docstring here for subset and add subset_varargs? Also add more examples?
There was a problem hiding this comment.
ah sure let me add them, thanks for the suggestion!
| df = self.connect.read.table(self.tbl_name2) | ||
| df2 = self.spark.read.table(self.tbl_name2) |
There was a problem hiding this comment.
why change the table here?
There was a problem hiding this comment.
It's because tbl_name1 only has two fields and I want to test 3 fields
| subset : list of column names, optional | ||
| List of columns to use for duplicate comparison (default All columns). | ||
|
|
||
| subset_varargs : optional arguments used for supporting variable-length argument. |
There was a problem hiding this comment.
nit: let's also add supported version for this parameter
| Deduplicate values on 'name' and 'height' columns. | ||
|
|
||
| >>> df.dropDuplicates(subset=['name', 'height']).show() | ||
| +-----+---+------+ | ||
| | name|age|height| | ||
| +-----+---+------+ | ||
| |Alice| 5| 80| | ||
| +-----+---+------+ |
There was a problem hiding this comment.
Hmm this example is exactly the same as the above one. It can be confusing to users whether they should use dropDuplicates(subset=['name', 'height']) or directly use dropDuplicates('name', 'height')
|
|
||
| Deduplicate values on 'value' columns. | ||
|
|
||
| >>> df.dropDuplicatesWithinWatermark(subset=['value']) # doctest: +SKIP |
|
My general comment here is that we should be opinionated and only have one way to perform certain operations. After this change, users now have two identical ways to drop duplicates:
Which one should users choose? P.S. Pandas / Pandas on Spark uses the |
…tesWIthinWatermark should accept variable length args ### What changes were proposed in this pull request? Per conversation from #47835 (comment), we will revert 560c083 for API parity with Pandas API ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? Yes, reverting the API would reenable user to use `dropDuplicates(subset=xxx)` ### How was this patch tested? Unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes #47916 from WweiL/revert-dropDuplicates-api. Authored-by: Wei Liu <wei.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
560c083 unintentionally made
dropDuplicates(subset=["col"])doesn't work, this patches this scenario.Why are the changes needed?
Bug fix
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit test
Was this patch authored or co-authored using generative AI tooling?
No