Skip to content

Conversation

@jkylling
Copy link

@jkylling jkylling commented Feb 3, 2026

It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet ParquetSource we added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code.

The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to opener.rs in this PR.

One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to opener.rs would be .with_virtual_columns(virtual_columns.to_vec())? (and maybe even that could be avoided? See the discussion below).

What would be the best way forward here?

Related to #20132

Aside on .with_virtual_columns

It is redundant that the user needs to specify both Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber), and add the column in a special way to the reader options with .with_virtual_columns(virtual_columns.to_vec())?. When the extension type RowNumber is added, we know that it is a virtual column.

All users of the TableSchema/ParquetSource must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema.

One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas.

…rquet

It would be useful to expose the virtual columns of the arrow Parquet reader in the datasource-parquet `ParquetSource` added in apache/arrow-rs#8715. Then engines can use both DataFusion's partition value machinery and the virtual columns. I made a go at it in this PR, but hit some rough edges. This is closer to an issue than a PR, but it is easier to explain with code.

The virtual columns we added are a bit difficult to integrate cleanly today. They are part of the physical schema of the Parquet reader, but cannot currently be projected. We need some additional handling to avoid predicate pushdown for virtual columns, to build the correct projection mask, and to build the correct stream schema. See the changes to `opener.rs` in this PR.

One alternative would be to modify the arrow-rs implementation to remove these workarounds. Then the only change to `opener.rs` would be `.with_virtual_columns(virtual_columns.to_vec())?` (and maybe even that could be avoided? See the discussion below).

What would be the best way forward here?

It is redundant that the user needs to specify both `Field::new("row_index", DataType::Int64, false).with_extension_type(RowNumber)`, and add the column in a special way to the reader options with `.with_virtual_columns(virtual_columns.to_vec())?`. When the extension type `RowNumber` is added, we know that it is a virtual column.

All users of the `TableSchema/ParquetSource` must know that a schema is built out of three parts: the physical Parquet columns, the virtual columns and the partition columns. From a user perspective, the user would just like to supply a schema.

One alternative is to only indicate the column kind using extension types, and the user only supplies a schema. That is, there would be an extension type indicating that a column is a partition column or virtual column, instead of the user supplying this information piecemeal. This may have a performance impact, as we would likely need to extract different extension type columns during planning, which could be problematic for large schemas.

Signed-off-by: Jonas Irgens Kylling <jkylling@gmail.com>
@github-actions github-actions bot added the datasource Changes to the datasource crate label Feb 3, 2026
assert_eq!(row_index_values, vec![2, 3, 4]);
}

// Test 2: Filter on virtual column does not have predicate pushdown
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No filtering on virtual columns in the Parquet source.

@jkylling
Copy link
Author

jkylling commented Feb 3, 2026

@alamb @vustef @scovich Would be great with your input on this!

@alamb
Copy link
Contributor

alamb commented Feb 3, 2026

What I recommend is that we figure out how the high level API will look like (aka how will someone query this via SQL and/or dataframe API)

Then we can expose the relevant APIs in the parquet reader and other datasources as appropriate

See also the discussion on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants