New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions #30867
Conversation
cc: @HyukjinKwon @tanelk |
This PR focuses on categorizing the existing built-in funcs. In a follow-p PR, I'll fix the doc-related issue. |
Kubernetes integration test starting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for Apache Spark 3.2.0, @maropu ?
Yea, I think so, @dongjoon-hyun cc: @HyukjinKwon |
Kubernetes integration test status success |
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala
Show resolved
Hide resolved
@@ -205,7 +207,8 @@ case class SchemaOfCsv( | |||
> SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat', 'dd/MM/yyyy')); | |||
26/08/2015 | |||
""", | |||
since = "3.0.0") | |||
since = "3.0.0", | |||
group = "csv_funcs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, Is this able to be struct_funcs
because the input type is StructType
, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @maropu . Mostly looks good. It seems that we need a few clarification.
- The category is based on the
outputType
orinputType
? Some functions are at the intersection of both types. e.g.StructToCsv
orCreateArray
. - The definition of
array_func
andcollection_func
? - Some function category is changed from Apache Spark 3.1.0. So, this is not
Adds a group tag
. Instead, this is re-categorization.
It would be enough if you clarify the design at the beginning of the PR description.
Test build #133117 has finished for PR 30867 at commit
|
Thanks for the comment, @dongjoon-hyun !
A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. But, yea, the two cases you pointed out above are ambiguous cases, I think. In the current approach,
Anyway, this is a first-shot to re-categorize them, so I'm open to other ideas. |
I've updated the title and the description based on the comment above. |
Test build #133133 has finished for PR 30867 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
Nice! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM for Apache Spark 3.2.0.
Thank you, @maropu and @HyukjinKwon !
Merged to master! |
Thanks, @dongjoon-hyun and @HyukjinKwon ! |
The enforcing is good to have at the early stage of Apache Spark 3.2.0 by preventing the increase of new expressions. |
What changes were proposed in this pull request?
This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions.
New groups below are added in this PR:
A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in
hash.scala
are categorized intohash_funcs
. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes:agg_funcs
.array_funcs
andmap_funcs
are sub-groups ofcollection_funcs
. For example,array_contains
is used only for arrays, so it is assigned toarray_funcs
. On the other hand,reverse
is used for both arrays and strings, so it is assigned tocollection_funcs
.schema_of_csv
can be grouped into bothcsv_funcs
andstruct_funcs
in terms of input types, but it is assigned tocsv_funcs
because it belongs to thecsvExpressions.scala
file that holds the other CSV-related functions.nullExpressions.scala
,complexTypeCreator.scala
,randomExpressions.scala
, andregexExpressions.scala
are categorized based on their functionalities. For example:isnull
innullExpressions
is assigned topredicate_funcs
because this is a predicate function.array
incomplexTypeCreator.scala
is assigned toarray_funcs
based on its output type (The other functions inarray_funcs
are categorized based on their input types though).A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master):
Closes #30040
NOTE: An original author of this PR is @tanelk, so the credit should be given to @tanelk.
Why are the changes needed?
For better documents.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add a test to check if exprs have a group tag in
ExpressionInfoSuite
.