Minor: Generate the supported Spark builtin expression list into MD file #455

comphead · 2024-05-21T01:35:59Z

Which issue does this PR close?

Follow up on #331
Closes #282
Related #240

Rationale for this change

Generate the supported Spark builtin expression list into MD file instead of txt file. Txt file still exists as it provides the reason why some expression is not supported

What changes are included in this PR?

How are these changes tested?

…uite.scala Co-authored-by: advancedxy <xianjin@apache.org>

…dable format

docs/spark_expressions_support.md

andygrove · 2024-05-21T15:08:29Z

This is very cool @comphead but it looks like it is not detecting any of the aggregate functions that we support?

docs/spark_expressions_support.md

advancedxy · 2024-05-22T03:23:04Z

docs/spark_expressions_support.md

+ - [ ] ifnull
+ - [ ] nanvl
+ - [x] nullif
+ - [ ] nvl


hmm, it should be supported? It's essential the same as coalesce, which is replaced during analysis phase.

Maybe we should file an issue to track this kind of problem.

advancedxy · 2024-05-22T03:26:27Z

docs/spark_expressions_support.md

+
+### hash_funcs
+ - [ ] crc32
+ - [ ] hash


~~hash should be supported.~~

Ah, the example hash function passes an array, which haven't been supported in Comet as supported type yet.

Maybe supporting nested type could be prioritized after we have fully TPC-H and TPC-DS support.

docs/spark_expressions_support.md

comphead · 2024-05-24T21:46:00Z

@andygrove @advancedxy I fixed the test, implementing extra parsing and manual small tests if the parsing is complicated. I hope now we have better picture.

advancedxy

Left some minor comments, otherwise LGTM.

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

comphead · 2024-05-28T02:10:57Z

Thanks @advancedxy I fixed the flaws you mentioned. However I'd like to make refactoring you recommended in followup PR, this PR getting too large for review

advancedxy · 2024-05-28T02:26:08Z

Thanks @advancedxy I fixed the flaws you mentioned. However I'd like to make refactoring you recommended in followup PR, this PR getting too large for review

Of course, sounds good to me.

advancedxy

one minor comment, others LGTM.

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala

advancedxy

lgtm

andygrove · 2024-05-29T16:46:19Z

docs/spark_expressions_support.md

+ - [x] regr_avgx
+ - [x] regr_avgy
+ - [x] regr_count


I don't think that we support these expressions

The test is exactly for year, but if only YEAR supported, what is supposed to show to the user? Not supported?

test("regr_avgx") { Seq(false, true).foreach { dictionary => withSQLConf( "parquet.enable.dictionary" -> dictionary.toString, "spark.comet.exec.shuffle.enabled" -> "true", CometConf.COMET_ENABLED.key -> "true", CometConf.COMET_EXEC_ENABLED.key -> "true", CometConf.COMET_SHUFFLE_ENFORCE_MODE_ENABLED.key -> "true", CometConf.COMET_EXEC_ALL_OPERATOR_ENABLED.key -> "true", ) { val table = "test" withTable(table) { sql(s"create table $table(a int, b int) using parquet") sql(s"insert into $table VALUES (1, 2), (2, 2), (2, 3), (2, 4)") checkSparkAnswerAndOperator(s"SELECT regr_avgx(a, b) FROM $table") } } } }

regr_avgx test passes

regr_avgx supported by DF

> SELECT regr_avgx(1, 2); +------------------------------+ | REGR_AVGX(Int64(1),Int64(2)) | +------------------------------+ | 2.0 | +------------------------------+

so I think all is fair here

andygrove · 2024-05-29T16:54:41Z

docs/spark_expressions_support.md

+ - [x] try_add
+ - [x] try_divide
+ - [x] try_multiply
+ - [x] try_subtract


I do not see any tests for these functions and the planner.rs seems to ignore the fail_on_error flag in the protobuf. If your tool says we do support them then we likely do support them but not correctly. I will file an issue to look into this.

Actually, we already have an issue: #280

test("try_add") { Seq(false, true).foreach { dictionary => withSQLConf( "parquet.enable.dictionary" -> dictionary.toString, "spark.comet.exec.shuffle.enabled" -> "true") { val table = "test" withTable(table) { sql(s"create table $table(a int, b int) using parquet") sql(s"insert into $table VALUES (1, 2)") checkSparkAnswerAndOperator(s"SELECT try_add(a, b) FROM $table") } } } }

such test passes

The plan is Comet

== Physical Plan == *(1) ColumnarToRow +- CometProject [try_add(a, b)#57], [(a#40 + b#41) AS try_add(a, b)#57] +- CometScan parquet spark_catalog.default.test[a#40,b#41] Batched: true, DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/Users/ovoievodin/dev/prj/apple/ovoievodin/rust/arrow-datafusion-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:int,b:int>

the answer the same as Spark which is 3

So we support try_* functions, no fallback happens because we handle

case add @ Add(left, right, _)

where the last argument is responsible for TRY mode.
so no fallback but calculatin not correct when overflow happens, I'll move try tests to manual

andygrove · 2024-05-29T16:57:18Z

docs/spark_expressions_support.md

+ - [x] current_catalog
+ - [x] current_database
+ - [x] current_schema
+ - [x] current_user


We don't have a native implementation for these methods. I am guessing that the Spark planner replaces these with literal values and we then see a CometProject for those literal values

That's correct. These expressions are replaced with literals after analysis stage.

Yes, those are purely spark execution, are we planning to natively support them in Comet?

We cannot support these natively in Comet because these functions will never appear in the physical plan. Spark replaces them with literal values before this.

sounds good, I'll remove them from the list

andygrove · 2024-05-29T17:54:35Z

docs/spark_expressions_support.md

+ - [ ] date_diff
+ - [ ] date_format
+ - [ ] date_from_unix_date
+ - [x] date_part


We only support date_part for YEAR

andygrove · 2024-05-29T17:54:48Z

docs/spark_expressions_support.md

+ - [ ] dayofmonth
+ - [ ] dayofweek
+ - [ ] dayofyear
+ - [x] extract


We only support extract for YEAR

comphead · 2024-05-31T00:08:36Z

@andygrove I fixed all the comments, however you are right, sometimes we support partially the function.
means part of syntax or some value range not supported.

here comes an idea for follow up PR to introduce partially supported status(or similar) with the reason why it is supported partially

andygrove · 2024-06-04T21:46:45Z

docs/source/user-guide/overview.md

@@ -29,7 +29,7 @@ Comet aims to support:
 - a native Parquet implementation, including both reader and writer
 - full implementation of Spark operators, including
  Filter/Project/Aggregation/Join/Exchange etc.
- full implementation of Spark built-in expressions
+- [full implementation](../../../docs/spark_expressions_support.md) of Spark built-in expressions.


This won't build correctly:

/Users/andy/git/apache/datafusion-comet/docs/temp/user-guide/overview.md:32: WARNING: Unknown source document '../spark_expressions_support' [myst.xref_missing]

Let's revert this change for this PR and handle where we publish (user guide vs contributor guide) in a follow-up PR.

docs/source/user-guide/overview.md

Co-authored-by: Andy Grove <andygrove73@gmail.com>

andygrove

Thanks @comphead. LGTM.

The content generated by this tool is helpful and I learned more about what we support from reviewing this PR.

My opinion is that the generated docs make sense for the contributors guide to help guide people on where to contribute but that we should keep a human-curated version for the user guide where we can add more context about supported expressions, but we can discuss this on a future PR.

codecov-commenter · 2024-06-05T01:12:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 34.23%. Comparing base (9ca63a2) to head (e8f3b77).
Report is 27 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main     #455      +/-   ##
============================================
+ Coverage     34.18%   34.23%   +0.04%     
+ Complexity      851      806      -45     
============================================
  Files           116      105      -11     
  Lines         38570    38488      -82     
  Branches       8531     8562      +31     
============================================
- Hits          13187    13175      -12     
+ Misses        22612    22554      -58     
+ Partials       2771     2759      -12

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

comphead and others added 10 commits May 20, 2024 18:37

Coverage: Add a manual test for DF supporting Spark expressions directly

c98d690

Update spark/src/test/scala/org/apache/comet/CometExpressionCoverageS…

91ce331

…uite.scala Co-authored-by: advancedxy <xianjin@apache.org>

Coverage: Add a manual test for DF supporting Spark expressions directly

91bdc82

Coverage: Add a manual test for DF supporting Spark expressions directly

2c97b53

Coverage: Add a manual test for DF supporting Spark expressions directly

5cac234

fmt

41e196e

fmt

29648a2

fmt

b4ac6e5

fmt

9cd18b1

minor: Generate supported spark builtin expression list into user rea…

e305f36

…dable format

comphead force-pushed the dev branch from 55cc332 to e305f36 Compare May 21, 2024 01:37

andygrove reviewed May 21, 2024

View reviewed changes

docs/spark_expressions_support.md Outdated Show resolved Hide resolved

andygrove reviewed May 21, 2024

View reviewed changes

docs/spark_expressions_support.md Outdated Show resolved Hide resolved

advancedxy reviewed May 22, 2024

View reviewed changes

andygrove reviewed May 23, 2024

View reviewed changes

docs/spark_expressions_support.md Outdated Show resolved Hide resolved

comphead added 2 commits May 24, 2024 14:28

avoid spark skip comet for literals

da1af59

rm not related change

59e8e78

advancedxy reviewed May 25, 2024

View reviewed changes

comphead added 2 commits May 27, 2024 19:08

uncomment tests

ce0d82e

uncomment tests

a014192

fmt

aa051a6

advancedxy reviewed May 28, 2024

View reviewed changes

spark/src/test/scala/org/apache/comet/CometExpressionCoverageSuite.scala Outdated Show resolved Hide resolved

fmt

c82f271

advancedxy approved these changes May 28, 2024

View reviewed changes

comphead requested review from andygrove and viirya May 28, 2024 15:50

andygrove reviewed May 29, 2024

View reviewed changes

comphead added 2 commits May 30, 2024 16:57

fix some tests

42ba078

fix some tests

848e106

comphead requested a review from andygrove June 3, 2024 16:37

andygrove reviewed Jun 4, 2024

View reviewed changes

docs/source/user-guide/overview.md Outdated Show resolved Hide resolved

Update docs/source/user-guide/overview.md

ec885d6

Co-authored-by: Andy Grove <andygrove73@gmail.com>

andygrove approved these changes Jun 4, 2024

View reviewed changes

comphead added 2 commits June 4, 2024 17:26

rm some out of roadmap funcs

b5817f4

Merge remote-tracking branch 'origin/dev' into dev

e8f3b77

comphead merged commit b3ba82f into apache:main Jun 5, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor: Generate the supported Spark builtin expression list into MD file #455

Minor: Generate the supported Spark builtin expression list into MD file #455

comphead commented May 21, 2024 •

edited

Loading

andygrove commented May 21, 2024

advancedxy May 22, 2024

advancedxy May 22, 2024

comphead commented May 24, 2024

advancedxy left a comment

comphead commented May 28, 2024

advancedxy commented May 28, 2024

advancedxy left a comment

advancedxy left a comment

andygrove May 29, 2024

comphead May 29, 2024

comphead May 29, 2024

comphead May 29, 2024

andygrove May 29, 2024

andygrove May 29, 2024

comphead May 29, 2024

comphead May 29, 2024

comphead May 29, 2024

andygrove May 29, 2024

viirya May 29, 2024

comphead May 29, 2024

andygrove Jun 4, 2024 •

edited

Loading

comphead Jun 4, 2024

andygrove May 29, 2024

andygrove May 29, 2024

comphead commented May 31, 2024

andygrove Jun 4, 2024

andygrove left a comment

codecov-commenter commented Jun 5, 2024

Minor: Generate the supported Spark builtin expression list into MD file #455

Minor: Generate the supported Spark builtin expression list into MD file #455

Conversation

comphead commented May 21, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

andygrove commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented May 24, 2024

advancedxy left a comment

Choose a reason for hiding this comment

comphead commented May 28, 2024

advancedxy commented May 28, 2024

advancedxy left a comment

Choose a reason for hiding this comment

advancedxy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented May 31, 2024

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

codecov-commenter commented Jun 5, 2024

Codecov Report

comphead commented May 21, 2024 •

edited

Loading

andygrove Jun 4, 2024 •

edited

Loading