Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-43926][CONNECT][PYTHON] Add array_agg, array_size, cardinality, count_min_sketch,mask,named_struct,json_* to Scala and Python #41718

Closed
wants to merge 10 commits into from

Conversation

ivoson
Copy link
Contributor

@ivoson ivoson commented Jun 25, 2023

What changes were proposed in this pull request?

Add following functions:

  • array_agg
  • array_size
  • cardinality
  • count_min_sketch
  • named_struct
  • json_array_length
  • json_object_keys
  • mask

To:

  • Scala API
  • Python API
  • Spark Connect Scala Client
  • Spark Connect Python Client

Why are the changes needed?

Add Scala, Python and Connect API for these sql functions: array_agg, array_size, cardinality, count_min_sketch, named_struct, json_array_length, json_object_keys, mask

Does this PR introduce any user-facing change?

Yes, added new functions.

How was this patch tested?

New UT added.

@ivoson
Copy link
Contributor Author

ivoson commented Jun 25, 2023

cc @zhengruifeng

@LuciferYang
Copy link
Contributor

also cc @HyukjinKwon @panbingkun @beliefer FYI

upperChar: Column,
lowerChar: Column,
digitChar: Column,
otherChar: Column): Column = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please supplement the API with other constructor of Mask.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add the other constuctors. Thanks.

@@ -368,6 +368,24 @@ object functions {
*/
def collect_set(columnName: String): Column = collect_set(Column(columnName))

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns a count-min sketch of a column with the given esp,
confidence and seed. The result is an array of bytes, which can be deserialized to a
CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for
cardinality estimation using sub-linear space.

Is the comment above more aligned?

_FUNC_(col, eps, confidence, seed) - Returns a count-min sketch of a column with the given esp,
confidence and seed. The result is an array of bytes, which can be deserialized to a
`CountMinSketch` before usage. Count-min sketch is a probabilistic data structure used for
cardinality estimation using sub-linear space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

@@ -14394,6 +14394,260 @@ def nvl2(col1: "ColumnOrName", col2: "ColumnOrName", col3: "ColumnOrName") -> Co
return _invoke_function_over_columns("nvl2", col1, col2, col3)


@try_remote_functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest placing the function in the appropriate location of the corresponding file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relocate the functions by group, thanks.

@@ -3603,6 +3603,80 @@ def nvl2(col1: "ColumnOrName", col2: "ColumnOrName", col3: "ColumnOrName") -> Co
nvl2.__doc__ = pysparkfuncs.nvl2.__doc__


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, done

@@ -5537,6 +5568,61 @@ class DataFrameFunctionsSuite extends QueryTest with SharedSparkSession {
)
}

test("json_array_length function") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these tests be placed in JsonFunctionsSuite?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, moved to JsonFunctionsSuite

// scalastyle:off line.size.limit
/**
* Masks the given string value. The function replaces upper-case, lower-case characters
* with specific characters, and numbers with 'n'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaces upper-case and lower-case characters with the characters specified respectively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

// scalastyle:off line.size.limit
/**
* Masks the given string value. The function replaces upper-case, lower-case characters and
* numbers with specific characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

_lowerChar = lit("x") if lowerChar is None else lowerChar
_digitChar = lit("n") if digitChar is None else digitChar
_otherChar = lit(None) if otherChar is None else otherChar
return _invoke_function_over_columns(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If user passed lowerChar without upperChar, what will happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, will repace lowerChar with specified character, and other cases with default value.

For example,

mask("col", lowerChar = 'y')

will replace:
lowerChar -> 'y'
upperChar -> 'X'
digitChar -> 'n'
otherChar -> retain original

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean is mask("col", 'y'), then user will be confused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the discussion between @zhengruifeng and me, the analyzer will check it too. You can temporarily ignore the above comment now.

*/
def mask(input: Column): Column = Column.fn("mask", input)

// scalastyle:off line.size.limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fix the code style and remove these scalastyle:off line.size.limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, please remove them,

I know there are some usage of scalastyle:off line.size.limit in this file, but they are due to long hyperlink string which can not be broken into multi line.

In this PR, i think it is not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new Mask(input.expr)
}

// scalastyle:off line.size.limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, we should avoid to use scalastyle:off for the new code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. thanks.

@zhengruifeng
Copy link
Contributor

thanks, merged to master

@ivoson
Copy link
Contributor Author

ivoson commented Jun 30, 2023

thanks a lot for your review. @zhengruifeng @LuciferYang @beliefer @panbingkun

@ivoson ivoson deleted the SPARK-43926 branch June 30, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants