Skip to content

feat: Enable native c2r by default, add debug asserts#3649

Merged
andygrove merged 7 commits intoapache:mainfrom
andygrove:enable-native-c2r
Mar 10, 2026
Merged

feat: Enable native c2r by default, add debug asserts#3649
andygrove merged 7 commits intoapache:mainfrom
andygrove:enable-native-c2r

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 8, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

~I see a 10% speedup in TPC-H @ 1TB with this feature enabled. edit: this was actually due to using native_datafusion scan. However, I see no perf regression from this PR, and it reduces GC pressure by avoiding many java allocations.

What changes are included in this PR?

  • Change config default
  • Add debug asserts before unsafe code
  • Update golden files

How are these changes tested?

@andygrove andygrove marked this pull request as ready for review March 8, 2026 23:18
) -> jni::sys::jobject {
try_unwrap_or_throw(&e, |mut env| {
// Get the context
debug_assert!(c2r_handle != 0, "columnarToRowConvert: c2r_handle is null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can c2r_handle be negative?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jlong is a signed 64-bit integer, so yes, c2r_handle can technically be negative since memory addresses above 2^63 appear as negative values when stored in a signed long. However, the only invalid value we need to check for is 0 (null pointer). Negative values are valid memory addresses on 64-bit systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the AI response. Claude didn't ask me first 😞

c2r_handle: jlong,
) {
try_unwrap_or_throw(&e, |_env| {
debug_assert!(c2r_handle != 0, "columnarToRowClose: c2r_handle is null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

let src_start = start_idx * $elem_size;
debug_assert!(
src_start + byte_len <= values_slice.len() * $elem_size,
"bulk_copy_range: source slice out of bounds: src_start={}, byte_len={}, values_len={}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we output elem.size as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

inherits = "release"
lto = false # Skip LTO for faster linking
codegen-units = 16 # Parallel codegen (faster compile, slightly larger binary)
debug-assertions = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be always true, even for releases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we wouldn't want to enable the assertions for releases because it would defeat the point of having the unsafe code that skips bounds checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, I thought it is enabled for release and it was my concern, for ci it is fine.

@andygrove andygrove marked this pull request as draft March 9, 2026 15:48
@andygrove andygrove marked this pull request as draft March 9, 2026 15:48
@andygrove
Copy link
Member Author

moving to draft while I run new benchmarks to confirm the benefit

@andygrove andygrove marked this pull request as ready for review March 9, 2026 16:31
@andygrove
Copy link
Member Author

It looks like a debug assertion did fail, but no detailed were logged. I am enabling RUST_BACKTRACE next.

Enable RUST_BACKTRACE=1 in all CI workflows to get full stack traces
when debug assertions or other panics occur. Also add elem_size to
the bulk_copy_range debug assert message for better diagnostics.
@andygrove
Copy link
Member Author

I am still not seeing a stack trace or reason for the failure.

2026-03-09T22:34:13.1918758Z [info] - SPARK-32069: Improve error message on reading unexpected directory (316 milliseconds)
2026-03-09T22:34:13.2165413Z 22:34:13.215 WARN org.apache.hadoop.hive.common.FileUtils: File file:/__w/datafusion-comet/datafusion-comet/apache-spark/sql/hive/target/tmp/hive_execution_test_group/warehouse-afa5d96b-8cb8-47a1-b7c5-386c60c294b3/src does not exist; Force to delete it.
2026-03-09T22:34:13.2166765Z 
2026-03-09T22:34:13.2167568Z 22:34:13.216 ERROR org.apache.hadoop.hive.common.FileUtils: Failed to delete file:/__w/datafusion-comet/datafusion-comet/apache-spark/sql/hive/target/tmp/hive_execution_test_group/warehouse-afa5d96b-8cb8-47a1-b7c5-386c60c294b3/src
2026-03-09T22:34:13.2169091Z 
2026-03-09T22:34:13.3359262Z [info] HashAggregationQueryWithControlledFallbackSuite:
2026-03-09T22:34:14.1173565Z thread caused non-unwinding panic. aborting.
2026-03-09T22:34:49.4551378Z Warning: Unable to read from client, please check on client for further details of the problem.

@andygrove
Copy link
Member Author

Added panic = "unwind" to help with this. It looks like the same assertion is failing in main branch.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andygrove
Copy link
Member Author

test failure is an existing issue in main branch and is fixed in #3652.

I will rebase this PR once that is merged

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove the CI fails presumably because of OOM issues

@comphead
Copy link
Contributor

test failure is an existing issue in main branch and is fixed in #3652.

I will rebase this PR once that is merged

The tests passed on CI, it says sbt.ForkMain 40713 failed with exit code 134

@andygrove
Copy link
Member Author

test failure is an existing issue in main branch and is fixed in #3652.
I will rebase this PR once that is merged

The tests passed on CI, it says sbt.ForkMain 40713 failed with exit code 134

A debug assert is triggered because of unaligned memory access.

@andygrove
Copy link
Member Author

test failure is an existing issue in main branch and is fixed in #3652.
I will rebase this PR once that is merged

The tests passed on CI, it says sbt.ForkMain 40713 failed with exit code 134

A debug assert is triggered because of unaligned memory access.

Actually, I am a bit confused now. The error is:

Comet native panic: panicked at core/src/execution/shuffle/spark_unsafe/row.rs:293:29:
misaligned pointer dereference: address must be a multiple of 0x8 but is 0xc1a80000c

I confirmed that the changes in #3652. do fix this, but not sure why this only shows up when enabling debug assertions 😕

@andygrove
Copy link
Member Author

test failure is an existing issue in main branch and is fixed in #3652.
I will rebase this PR once that is merged

The tests passed on CI, it says sbt.ForkMain 40713 failed with exit code 134

A debug assert is triggered because of unaligned memory access.

Actually, I am a bit confused now. The error is:

Comet native panic: panicked at core/src/execution/shuffle/spark_unsafe/row.rs:293:29:
misaligned pointer dereference: address must be a multiple of 0x8 but is 0xc1a80000c

I confirmed that the changes in #3652. do fix this, but not sure why this only shows up when enabling debug assertions 😕

I suppose the debug assertion must be in Rust

@parthchandra
Copy link
Contributor

The expression causing panic may be in the debug_assert. It would then show up only in debug. The issue would be missed silently!

@andygrove andygrove merged commit 98b4484 into apache:main Mar 10, 2026
139 of 159 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants