Skip to content

chore: Experiments with native_datafusion scan optimizations#3755

Draft
comphead wants to merge 6 commits intoapache:mainfrom
comphead:tests
Draft

chore: Experiments with native_datafusion scan optimizations#3755
comphead wants to merge 6 commits intoapache:mainfrom
comphead:tests

Conversation

@comphead
Copy link
Contributor

@comphead comphead commented Mar 21, 2026

Which issue does this PR close?

Part of #3748

Rationale for this change

This PR addresses some low hanging fruits with native_datafusion scans:

  • Cache parquet footer metadata across partitions to avoid unnecessary work, especially important for files with huge schema
  • optimize O(n*m) schema case sensitive transformation calls, in call stack currently I can see
  22.57 MB       0.2%	1210640	 	alloc::str::_$LT$impl$u20$str$GT$::to_lowercase::h97626021e3e4d091
  22.57 MB       0.2%	1210560	 	 _$LT$comet..parquet..schema_adapter..SparkPhysicalExprAdapterFactory$u20$as$u20$datafusion_physical_expr_adapter..schema_rewriter..PhysicalExprAdapterFactory$GT$::create::hd8340ae81808f4b1
  22.57 MB       0.2%	1210560	 	  _$LT$datafusion_datasource_parquet..opener..ParquetOpener$u20$as$u20$datafusion_datasource..file_stream..FileOpener$GT$::open::_$u7b$$u7b$closure$u7d$$u7d$::h3da9f8bfc88e2ec1
  22.57 MB       0.2%	1210560	 	   _$LT$datafusion_datasource..file_stream..FileStream$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::hdbfbe1789f8ca04e
  22.57 MB       0.2%	1210560	 	    _$LT$datafusion_physical_plan..coop..CooperativeStream$LT$T$GT$$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::haf0d3c646b21dc34
  22.57 MB       0.2%	1210560	 	     _$LT$datafusion_physical_plan..stream..BatchSplitStream$u20$as$u20$futures_core..stream..Stream$GT$::poll_next::h776c49f166956374
  22.57 MB       0.2%	1210560	 	      comet::execution::jni_api::Java_org_apache_comet_Native_executePlan::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hedc575c357701add
  22.57 MB       0.2%	1210560	 	       tokio::runtime::task::raw::poll::h9f995a0e9bae688f
  22.57 MB       0.2%	1210560	 	        tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h14fd0b61f20c7a54
  22.57 MB       0.2%	1210560	 	         tokio::runtime::scheduler::multi_thread::worker::run::hbc8a3dbb6ce91c58
  22.57 MB       0.2%	1210560	 	          tokio::runtime::task::raw::poll::h578114713c014b13
  22.57 MB       0.2%	1210560	 	           std::sys::backtrace::__rust_begin_short_backtrace::h721a2d1d9a0ad1e9
  22.57 MB       0.2%	1210560	 	            core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hcb79850325811dbc

which allocates 1M times when doing case-sensitive transformation in schema adapter for 20K rows with deeply nested schema

What changes are included in this PR?

How are these changes tested?

Comment on lines +119 to +120
// Pre-compute lowercased physical field names to avoid repeated
// to_lowercase() calls in the O(n*m) matching loop.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance we could use eq_ignore_ascii_case to avoid allocating the lower case strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats good point

@comphead
Copy link
Contributor Author

Not sure why partitioned table is cached when partition pruning is true *** FAILED *** (1 second, 574 milliseconds) 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants