Skip to content

feat: add support crc32 expression#3498

Open
rafafrdz wants to merge 5 commits intoapache:mainfrom
rafafrdz:feat/add-support-crc32
Open

feat: add support crc32 expression#3498
rafafrdz wants to merge 5 commits intoapache:mainfrom
rafafrdz:feat/add-support-crc32

Conversation

@rafafrdz
Copy link

Summary

  • Added support for Spark crc32 by mapping Crc32 to the native scalar function crc32 in expression serde.
  • Extended hash-function coverage in CometExpressionSuite to include crc32 on string and casted numeric inputs.
  • Updated spark_expressions_support.md to mark crc32 as supported.

Why
This closes one of the missing DataFusion 50 migration functions from issue #2443 and reduces fallback to Spark for crc32 queries.

Part of #2443 and #240

@andygrove
Copy link
Member

Thanks for adding crc32 support @rafafrdz!

I think SparkCrc32 from datafusion-spark needs to be registered in the session context. If you look at register_datafusion_spark_function() in native/core/src/execution/jni_api.rs, you'll see that SparkSha1 and SparkSha2 are both explicitly imported and registered there, but SparkCrc32 isn't yet. Without that, the fallback registry.udf("crc32") lookup won't find the function. Something like this should do it:

use datafusion_spark::function::hash::crc32::SparkCrc32;
// ...
session_ctx.register_udf(ScalarUDF::new_from_impl(SparkCrc32::default()));

For the tests, Comet now has a SQL file-based testing framework (CometSqlFileTestSuite) that's preferred for expression tests. Could you add crc32.sql there, following instructions in https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html?

@andygrove
Copy link
Member

Thanks for adding crc32 support @rafafrdz!

I think SparkCrc32 from datafusion-spark needs to be registered in the session context. If you look at register_datafusion_spark_function() in native/core/src/execution/jni_api.rs, you'll see that SparkSha1 and SparkSha2 are both explicitly imported and registered there, but SparkCrc32 isn't yet. Without that, the fallback registry.udf("crc32") lookup won't find the function. Something like this should do it:

use datafusion_spark::function::hash::crc32::SparkCrc32;
// ...
session_ctx.register_udf(ScalarUDF::new_from_impl(SparkCrc32::default()));

For the tests, Comet now has a SQL file-based testing framework (CometSqlFileTestSuite) that's preferred for expression tests. Could you add crc32.sql there, following instructions in https://datafusion.apache.org/comet/contributor-guide/sql-file-tests.html?

Tests currently fail because the function isn't registered, as expected. @rafafrdz how did you test this locally?

- hash functions *** FAILED *** (386 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3753.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3753.0 (TID 12906) (localhost executor driver): org.apache.comet.CometNativeException: Error from DataFusion: There is no UDF named "crc32" in the registry. Use session context `register_udf` function to register a custom UDF.

@rafafrdz rafafrdz force-pushed the feat/add-support-crc32 branch from 52d95d7 to 29d9250 Compare February 13, 2026 09:35
@rafafrdz
Copy link
Author

@andygrove Sorry I missed push the commit where I register the udf... 😅
Also I built and tested locally onto the matrix. For that, I'm using mise to handle the different version of java and scala. And I'm using the following command:

./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite <test>" -Pspark-<spark_version> -Pscala-<scala_version>

for example:

./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Pspark-4.0 -Pscala-2.13
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Pspark-3.5 -Pscala-2.13
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite crc32" -Pspark-3.5 -Pscala-2.12

and changing the java version: mise use java@corretto-17 or mise use java@corretto-11

Please, let me know if I'm missing something or doing something wrong. And apologize for the inconveniences

query
SELECT crc32(cast(a as string)), crc32(cast(b as string)) FROM test

-- literal arguments query ignore(https://github.com/apache/datafusion-comet/issues/3340)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a newline is missing here?

Suggested change
-- literal arguments query ignore(https://github.com/apache/datafusion-comet/issues/3340)
-- literal arguments
query ignore(https://github.com/apache/datafusion-comet/issues/3340)

INSERT INTO test VALUES ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999), ('Spark SQL ', 10, 1.2), (NULL, NULL, NULL), ('', 0, 0.0), ('苹果手机', NULL, 3.999999)

query
SELECT crc32(cast(a as string)), crc32(cast(b as string)) FROM test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there also be a test for crc32(col)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants