Skip to content

Hooks to re-evaluate table level filters/indexes during file scans #17954

@adriangb

Description

@adriangb

Consider the scenario of:

SELECT *
FROM large_table
JOIN small_table ON large_table.id = small_table.id
WHERE small_table.name = 'Adrian';

As per our recent blog post we will first scan small_table, find the id for 'Adrian' and then scan large_table with that information available. But what if we had an external table level point lookup index for large_table.id? We won't be able to use that during the scan.

One option is to add hooks to the parquet readers that get called before each scan, something like:

trait ScanPlanUpdater {
   async fn rescan(&self, file: PartitionedFile, plan: FileScanPlan) -> Result<FileScanPlan>;
}

Then we call this before we do any more work on this file to allow checking the point lookup index. The main issue with this option is that it could result in a lot more of lookups into the point lookup index than if it was done once at the table level. Maybe implementations of ScanPlanUpdater can have some sort of cache? I don't see a way to do it at the table level, the concept of a table is long gone by this point and I can't think of a low friction way to apply a filter to an entire DataSourceExec.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions