-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48359][SQL] Built-in functions for Zstd compression and decompression #46672
base: master
Are you sure you want to change the base?
Conversation
Instead of adding (de)compression functions for different codecs, how about adding the |
Hi @yaooqinn, yes, that can be one way of implementing them. However, based on the following,
Thus, the functions are named |
A parameter with a default value can achieve this. The default value can be either hard coded or configurable by session conf. If
Most of the existing SQL functions are derived from other systems, Apache Hive, Postgres, MySQL, etc. AFAIK, Spark currently does not have such a naming convention itself, while 'supported by many other modern platforms' or 'defined in ANSI' are the rules we used mostly for adding new SQL functions |
What changes were proposed in this pull request?
Some users are using UDFs for Zstd compression and decompression, which results in poor performance. If we provide native functions, the performance will be improved by compressing and decompressing just within the JVM.
Now, we are introducing three new built-in functions:
where
input
: The binary value to compress or decompress.level
: Optional integer argument that represents the compression level. The compression level controls the trade-off between compression speed and compression ratio. The default level is 3. Valid values: between 1 and 22 inclusivestreaming_mode
: Optional boolean argument that represents whether to use streaming mode to compress.Examples:
These three built-in functions are also available in Python and Scala.
Why are the changes needed?
Users no longer need to use UDFs for Zstd compression and decompression; they can directly use built-in SQL functions to run within the JVM.
Does this PR introduce any user-facing change?
Yes, three SQL functions -
zstd_compress
,zstd_decompress
, andtry_zstd_decompress
are introduced.How was this patch tested?
Added new UT and E2E tests.
Was this patch authored or co-authored using generative AI tooling?
No.