Skip to content

Conversation

@adriangb
Copy link
Contributor

Summary

Extract benchmarks and sqllogictest cases from #19538 for easier review.

This PR includes:

  • New Benchmark: parquet_struct_query.rs - Benchmarks SQL queries on struct columns in Parquet files

    • 524,288 rows across 8 row groups
    • 20 benchmark queries covering struct access, filtering, joins, and aggregations
    • Struct schema: id (Int32) and s (Struct with id/Int32 and value/Utf8 fields)
  • SQLLogicTest: projection_pushdown.slt - Tests for projection pushdown optimization

Changes

  • Added datafusion/core/benches/parquet_struct_query.rs
  • Updated datafusion/core/Cargo.toml with benchmark entry
  • Added datafusion/sqllogictest/test_files/projection_pushdown.slt

Test Plan

  • Run benchmark: cargo bench --profile dev --bench parquet_struct_query
  • All 20 benchmark queries execute successfully
  • Parquet file generated with correct row count (524,288) and row groups (8)

🤖 Generated with Claude Code

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jan 23, 2026
@adriangb adriangb requested review from alamb and Copilot January 23, 2026 23:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extracts benchmarks and sqllogictest cases from PR #19538 for easier review, focusing on testing struct field access projection pushdown optimization in DataFusion.

Changes:

  • Added comprehensive benchmark suite for SQL queries on struct columns in Parquet files with 20 different query patterns
  • Added 1000+ line SQLLogicTest file covering projection pushdown behavior with get_field expressions through various operators
  • Updated Cargo.toml to register the new benchmark

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
datafusion/core/benches/parquet_struct_query.rs New benchmark file testing struct field queries on Parquet data with various SQL patterns (filters, joins, aggregations, etc.)
datafusion/core/Cargo.toml Added benchmark entry for parquet_struct_query with parquet feature requirement
datafusion/sqllogictest/test_files/projection_pushdown.slt Comprehensive test suite for get_field projection pushdown through Filter, Sort, TopK, and multi-partition scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@adriangb adriangb changed the title Add parquet struct query benchmark and projection pushdown tests Add struct pushdown query benchmark and projection pushdown tests Jan 23, 2026
adriangb and others added 2 commits January 23, 2026 18:53
Extract benchmarks and sqllogictest cases from apache#19538 for easier review.
Includes a new benchmark for SQL queries on struct columns in Parquet files,
covering struct access, filtering, joins, and aggregations with 524K rows
and 8 row groups.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- thank you @adriangb

logical_plan
01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value"))
02)--TableScan: simple_struct projection=[id, s]
physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, get_field(s@1, value) as simple_struct.s[value]], file_type=parquet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is interesting that these expressions have already been pushed down to the datasource

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep in some cases (no sort, no repartition, etc) it already works, but only because all projections are pushed down.

[[bench]]
harness = false
name = "parquet_query_sql"
required-features = ["parquet"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to just add the benchmarks to parquet_query_sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could but it’s kind of nice to be able to run them in isolation easily at least for now while we’re developing just these. And in some sense the feature we’re working on needn’t be parquet specific (eg Vortex). We can always fold them later.

@adriangb adriangb added this pull request to the merge queue Jan 24, 2026
@adriangb
Copy link
Contributor Author

Thanks @alamb !

Merged via the queue into apache:main with commit 23f5003 Jan 24, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants