Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Add Spark expression coverage #240

Open
viirya opened this issue Mar 31, 2024 · 11 comments
Open

[EPIC] Add Spark expression coverage #240

viirya opened this issue Mar 31, 2024 · 11 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@viirya
Copy link
Member

viirya commented Mar 31, 2024

What is the problem the feature request solves?

This is an umbrella ticket for the list of unsupported Spark expressions. This is not necessary comprehensive list of all Spark expressions because they are too many. We can start from frequently used expressions.

Hash expressions #205

  • Md5: supported by DataFusion, but Comet disables it for now. DataFusion crypto_expressions includes blake3 which cannot be built on Mac platform. We cannot separately enable only md5 in DataFusion.
  • Murmur3Hash
  • XxHash64
  • HiveHash
  • Sha1
  • Sha2
  • Crc32

Datetime expressions

Conditional expressions

  • If
  • CaseWhen

Arithmetic expressions

  • UnaryMinus
  • UnaryPositive
  • Abs
  • Add
  • Subtract
  • Multiply
  • Divide
  • IntegralDivide
  • Remainder
  • Pmod
  • Least
  • Greatest

Bitwise expressions

  • BitwiseAnd
  • BitwiseOr
  • BitwiseXor
  • BitwiseNot
  • BitwiseCount
  • BitwiseGet

String expressions

  • Ascii
  • BitLength
  • StringInstr
  • StringRepeat
  • StringReplace
  • StringTranslate
  • StringTrim
  • StringTrimLeft
  • StringTrimRight
  • StringTrimBoth
  • Upper
  • Lower
  • Length
  • InitCap
  • Chr
  • ConcatWs

Math expressions

  • Acos
  • Asin
  • Atan
  • Atan2
  • Ceil
  • Cos
  • Exp
  • Floor
  • Log
  • Log10
  • Log2
  • Pow
  • Round (3.3+)
  • Signum
  • Sin
  • Sqrt
  • Tan
  • Asinh
  • Sinh
  • Csc
  • Rint
  • Log1p
  • Factorial
  • RoundFloor
  • Expm1
  • Conv
  • Acosh
  • Cosh
  • Sec
  • RoundCeil
  • Cbrt
  • Pi
  • EulerNumber
  • Cot
  • Tanh
  • Atanh
  • ToDegrees
  • ToRadians
  • Bin
  • Hex
  • Unhex
  • ShiftLeft
  • ShiftRight
  • ShiftRightUnsigned
  • Hypot
  • Logarithm
  • BRound
  • WidthBucket

Predicates

  • In
  • InSet
  • Not
  • InSubquery
  • And
  • Or
  • EqualTo
  • EqualNullSafe
  • EqualNull
  • LessThan
  • LessThanOrEqual
  • GreaterThan
  • GreaterThanOrEqual

Null expressions

  • Coalesce

Aggregate expressions

Others

  • Cast

...

Describe the potential solution

No response

Additional context

No response

@viirya viirya added the enhancement New feature or request label Mar 31, 2024
@comphead
Copy link
Contributor

comphead commented Apr 4, 2024

The list of Spark expression can be found https://spark.apache.org/docs/latest/api/sql/index.html

@advancedxy
Copy link
Contributor

As an umbrella issue, If you are going to make a list of frequently used expressions, maybe you can add the hash expressions(which I created #205 earlier) as one of the categories/list.

@viirya
Copy link
Member Author

viirya commented Apr 7, 2024

I added hash expressions. Free feel to edit the expression list in the issue description to add more expressions.

@comphead
Copy link
Contributor

comphead commented Apr 7, 2024

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan. This will allow Comet be enabled when running queries like

select sqrt(2) from (select 1 union all select 2)

Once its done, we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

@advancedxy
Copy link
Contributor

advancedxy commented Apr 8, 2024

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan.

I am adding RowToColumnar support in #206. Once it's done, I think it's trivial to add RDDScanExec(which OneRowRelation is translated to as PhysicalPlan) support.

we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

That sounds like a great idea.

@comphead
Copy link
Contributor

comphead commented Apr 9, 2024

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec. I'll check if its doable

@advancedxy
Copy link
Contributor

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec

Of course, that would be more performant and straightforward.

@viirya
Copy link
Member Author

viirya commented Apr 10, 2024

It sounds good if we can automatically test expression coverage, although I'm not sure if it is easy to do.

@comphead
Copy link
Contributor

comphead commented Apr 10, 2024

I have an idea to do that. Planning to create a draft soon.
Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

It will be easier to do if Comet supported OneRowRelation but even without it there is a workaround. Once all builtinn function queries done there should be some HTML with total results

@advancedxy
Copy link
Contributor

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class.
See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

@comphead
Copy link
Contributor

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class. See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

Great idea, I think I'm able to fetch it as here https://github.com/apache/spark/blob/6fdf9c9df545ed50acbce1ec874625baf03d4d2e/sql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala#L166

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants