-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Consider the scenario of:
SELECT *
FROM large_table
JOIN small_table ON large_table.id = small_table.id
WHERE small_table.name = 'Adrian';As per our recent blog post we will first scan small_table, find the id for 'Adrian' and then scan large_table with that information available. But what if we had an external table level point lookup index for large_table.id? We won't be able to use that during the scan.
One option is to add hooks to the parquet readers that get called before each scan, something like:
trait ScanPlanUpdater {
async fn rescan(&self, file: PartitionedFile, plan: FileScanPlan) -> Result<FileScanPlan>;
}Then we call this before we do any more work on this file to allow checking the point lookup index. The main issue with this option is that it could result in a lot more of lookups into the point lookup index than if it was done once at the table level. Maybe implementations of ScanPlanUpdater can have some sort of cache? I don't see a way to do it at the table level, the concept of a table is long gone by this point and I can't think of a low friction way to apply a filter to an entire DataSourceExec.