Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45232][SQL][DOCS] Add missing function groups to SQL references #43011

Closed
wants to merge 7 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 20, 2023

What changes were proposed in this pull request?

Add missing function groups to SQL references:

  • xml_funcs
  • lambda_funcs
  • collection_funcs
  • url_funcs
  • hash_funcs
  • struct_funcs

Note that this PR doesn't fix table_funcs:
1, gen-sql-functions-docs.py doesn't work properly with TableFunctionRegistry, I took a cursory look but fail to fix it;
2, table functions except range (e.g. explode) were already contained in Generator Functions, not sure we need to show them twice.

Why are the changes needed?

when referring to the SQL references, I find many functions are missing https://spark.apache.org/docs/latest/sql-ref-functions.html.

Does this PR introduce any user-facing change?

yes

How was this patch tested?

manually check

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title [SPARK-45232][DOC] Add missing function groups to SQL references [WIP][SPARK-45232][DOC] Add missing function groups to SQL references Sep 20, 2023
@@ -34,6 +34,8 @@
"math_funcs", "conditional_funcs", "generator_funcs",
"predicate_funcs", "string_funcs", "misc_funcs",
"bitwise_funcs", "conversion_funcs", "csv_funcs",
"xml_funcs", "lambda_funcs", "collection_funcs",
"url_funcs", "hash_funcs", "struct_funcs",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check against

private static final Set<String> validGroups =
new HashSet<>(Arrays.asList("agg_funcs", "array_funcs", "binary_funcs", "bitwise_funcs",
"collection_funcs", "predicate_funcs", "conditional_funcs", "conversion_funcs",
"csv_funcs", "datetime_funcs", "generator_funcs", "hash_funcs", "json_funcs",
"lambda_funcs", "map_funcs", "math_funcs", "misc_funcs", "string_funcs", "struct_funcs",
"window_funcs", "xml_funcs", "table_funcs", "url_funcs"));

two difference:
1, table_funcs: not support in gen-sql-functions-docs.py;
2, binary_funcs: I can not find any function using this group;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: For generator_funcs, do we have documentation for them when used in the FROM clause of a query? Functions like explode are typically considered table-valued generator functions.

@@ -34,6 +34,8 @@
"math_funcs", "conditional_funcs", "generator_funcs",
"predicate_funcs", "string_funcs", "misc_funcs",
"bitwise_funcs", "conversion_funcs", "csv_funcs",
"xml_funcs", "lambda_funcs", "collection_funcs",
"url_funcs", "hash_funcs", "struct_funcs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: For generator_funcs, do we have documentation for them when used in the FROM clause of a query? Functions like explode are typically considered table-valued generator functions.

@zhengruifeng zhengruifeng changed the title [WIP][SPARK-45232][DOC] Add missing function groups to SQL references [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references Sep 21, 2023
@zhengruifeng
Copy link
Contributor Author

@gatorsmile
Copy link
Member

cc @srielau

@zhengruifeng
Copy link
Contributor Author

image

image

image

image

image

image

image

@zhengruifeng
Copy link
Contributor Author

we can check the documents built in the GA of this PR, https://github.com/zhengruifeng/spark/actions/runs/6249096629

image

however, it expires after one day

@zhengruifeng zhengruifeng changed the title [WIP][SPARK-45232][DOCS] Add missing function groups to SQL references [SPARK-45232][DOCS] Add missing function groups to SQL references Sep 21, 2023
@srielau
Copy link
Contributor

srielau commented Sep 21, 2023

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific.
But lambda is "technology". What is the user journey that would drive one to browse lambda functions?

@zhengruifeng zhengruifeng changed the title [SPARK-45232][DOCS] Add missing function groups to SQL references [SPARK-45232][SQL][DOCS] Add missing function groups to SQL references Sep 21, 2023
@zhengruifeng
Copy link
Contributor Author

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific. But lambda is "technology".

lambda functions were already exposed to end users (e.g. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html#pyspark.sql.functions.array_sort).

I think if we document other functions here, it is better to add lambda functions as well.

What is the user journey that would drive one to browse lambda functions?

I think this could be an example: when a user try to sort array of structs by a specific order, he may refer to the document of array_sort

@HyukjinKwon
Copy link
Member

Just to be clear, this is automatic documentation based on the current documentation. If the grouping is wrong, or to be fixed, we should fix functions.scala main documentation.

@zhengruifeng
Copy link
Contributor Author

@HyukjinKwon this page is not built from functions.scala, but from the groups specified in expression definitions, like

@HyukjinKwon
Copy link
Member

yeah, I mean individual ExpressionDescription.

@srielau
Copy link
Contributor

srielau commented Sep 22, 2023

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific. But lambda is "technology".

lambda functions were already exposed to end users (e.g. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html#pyspark.sql.functions.array_sort).

I think if we document other functions here, it is better to add lambda functions as well.

What is the user journey that would drive one to browse lambda functions?

I think this could be an example: when a user try to sort array of structs by a specific order, he may refer to the document of array_sort

If I try to find a function that sorts arrays I will try to find that function under collection functions.
Just like substr() is not a math function, even though most of its arguments are integers. Substr operates on strings...
array_sort operates on arrays.

@zhengruifeng
Copy link
Contributor Author

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific. But lambda is "technology".

lambda functions were already exposed to end users (e.g. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html#pyspark.sql.functions.array_sort).
I think if we document other functions here, it is better to add lambda functions as well.

What is the user journey that would drive one to browse lambda functions?

I think this could be an example: when a user try to sort array of structs by a specific order, he may refer to the document of array_sort

If I try to find a function that sorts arrays I will try to find that function under collection functions. Just like substr() is not a math function, even though most of its arguments are integers. Substr operates on strings... array_sort operates on arrays.

Got it.
I think we can rename the group name in .md files, but need to be different from others. What about Advanced Collection Functions? @srielau

@srielau
Copy link
Contributor

srielau commented Sep 24, 2023

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific. But lambda is "technology".

lambda functions were already exposed to end users (e.g. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html#pyspark.sql.functions.array_sort).
I think if we document other functions here, it is better to add lambda functions as well.

What is the user journey that would drive one to browse lambda functions?

I think this could be an example: when a user try to sort array of structs by a specific order, he may refer to the document of array_sort

If I try to find a function that sorts arrays I will try to find that function under collection functions. Just like substr() is not a math function, even though most of its arguments are integers. Substr operates on strings... array_sort operates on arrays.

Got it. I think we can rename the group name in .md files, but need to be different from others. What about Advanced Collection Functions? @srielau

How about having our cake and eat it to? Can a function be in more than one group?

@zhengruifeng
Copy link
Contributor Author

What is the purpose of "lambda function"? All others are type-specific or "functionality"-specific. But lambda is "technology".

lambda functions were already exposed to end users (e.g. https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.array_sort.html#pyspark.sql.functions.array_sort).
I think if we document other functions here, it is better to add lambda functions as well.

What is the user journey that would drive one to browse lambda functions?

I think this could be an example: when a user try to sort array of structs by a specific order, he may refer to the document of array_sort

If I try to find a function that sorts arrays I will try to find that function under collection functions. Just like substr() is not a math function, even though most of its arguments are integers. Substr operates on strings... array_sort operates on arrays.

Got it. I think we can rename the group name in .md files, but need to be different from others. What about Advanced Collection Functions? @srielau

How about having our cake and eat it to? Can a function be in more than one group?

probably we can. I will try to map lambda to collection just in the doc build.

I think making a function in more than one group would be much complex.

@zhengruifeng
Copy link
Contributor Author

image

image

now we put lambda functions into collection functions

@zhengruifeng
Copy link
Contributor Author

@srielau I have put lambda functions in collection functions, I think this PR is ready to merge?

@zhengruifeng
Copy link
Contributor Author

thanks @srielau @allisonwang-db @HyukjinKwon

merged to master

@zhengruifeng zhengruifeng deleted the doc_xml_functions branch September 26, 2023 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants