New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43926][CONNECT][PYTHON] Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python #41718
Conversation
also cc @HyukjinKwon @panbingkun @beliefer FYI |
upperChar: Column, | ||
lowerChar: Column, | ||
digitChar: Column, | ||
otherChar: Column): Column = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please supplement the API with other constructor of Mask
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add the other constuctors. Thanks.
@@ -368,6 +368,24 @@ object functions { | |||
*/ | |||
def collect_set(columnName: String): Column = collect_set(Column(columnName)) | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns a count-min sketch of a column with the given esp,
confidence and seed. The result is an array of bytes, which can be deserialized to a
CountMinSketch
before usage. Count-min sketch is a probabilistic data structure used for
cardinality estimation using sub-linear space.
Is the comment above more aligned?
Lines 45 to 48 in 00e7c08
_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp, | |
confidence and seed. The result is an array of bytes, which can be deserialized to a | |
`CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for | |
cardinality estimation using sub-linear space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
python/pyspark/sql/functions.py
Outdated
@@ -14394,6 +14394,260 @@ def nvl2(col1: "ColumnOrName", col2: "ColumnOrName", col3: "ColumnOrName") -> Co | |||
return _invoke_function_over_columns("nvl2", col1, col2, col3) | |||
|
|||
|
|||
@try_remote_functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest placing the function in the appropriate location of the corresponding file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relocate the functions by group, thanks.
@@ -3603,6 +3603,80 @@ def nvl2(col1: "ColumnOrName", col2: "ColumnOrName", col3: "ColumnOrName") -> Co | |||
nvl2.__doc__ = pysparkfuncs.nvl2.__doc__ | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, done
@@ -5537,6 +5568,61 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession { | |||
) | |||
} | |||
|
|||
test("json_array_length function") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these tests be placed in JsonFunctionsSuite
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, moved to JsonFunctionsSuite
// scalastyle:off line.size.limit | ||
/** | ||
* Masks the given string value. The function replaces upper-case, lower-case characters | ||
* with specific characters, and numbers with 'n'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaces upper-case and lower-case characters with the characters specified respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
// scalastyle:off line.size.limit | ||
/** | ||
* Masks the given string value. The function replaces upper-case, lower-case characters and | ||
* numbers with specific characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
python/pyspark/sql/functions.py
Outdated
_lowerChar = lit("x") if lowerChar is None else lowerChar | ||
_digitChar = lit("n") if digitChar is None else digitChar | ||
_otherChar = lit(None) if otherChar is None else otherChar | ||
return _invoke_function_over_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If user passed lowerChar without upperChar, what will happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, will repace lowerChar with specified character, and other cases with default value.
For example,
mask("col", lowerChar = 'y')
will replace:
lowerChar -> 'y'
upperChar -> 'X'
digitChar -> 'n'
otherChar -> retain original
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean is mask("col", 'y')
, then user will be confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the discussion between @zhengruifeng and me, the analyzer will check it too. You can temporarily ignore the above comment now.
*/ | ||
def mask(input: Column): Column = Column.fn("mask", input) | ||
|
||
// scalastyle:off line.size.limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fix the code style and remove these scalastyle:off line.size.limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, please remove them,
I know there are some usage of scalastyle:off line.size.limit
in this file, but they are due to long hyperlink string which can not be broken into multi line.
In this PR, i think it is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. thanks @LuciferYang @zhengruifeng
new Mask(input.expr) | ||
} | ||
|
||
// scalastyle:off line.size.limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, we should avoid to use scalastyle:off
for the new code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. thanks.
thanks, merged to master |
thanks a lot for your review. @zhengruifeng @LuciferYang @beliefer @panbingkun |
What changes were proposed in this pull request?
Add following functions:
To:
Why are the changes needed?
Add Scala, Python and Connect API for these sql functions: array_agg, array_size, cardinality, count_min_sketch, named_struct, json_array_length, json_object_keys, mask
Does this PR introduce any user-facing change?
Yes, added new functions.
How was this patch tested?
New UT added.