Skip to content

Late materialization when LIMIT prunes heavily. #23263

Description

@RatulDawar

Currently ClickBench Q23 is a major outliner due to the fact that datafusion doesn't do late materialization of projected columns.

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;

A query like this needs to scan all the columns only after all the redundant rows are prunned from output.

Some numbers that I was able to get locally showing this regression in play.

Engine Query time
DataFusion 54 (SELECT *) ~4.0s avg
DataFusion 54 (SELECT "EventTime", "URL") ~0.9s
SortPreservingMergeExec (limit 10)
└─ SortExec TopK (EventTime), DynamicFilter [ empty ]
   └─ ProjectionExec (105 columns)
      └─ FilterExec (URL LIKE '%google%')
         └─ DataSourceExec (projection=[105 cols], 111 files)
              predicate: URL LIKE %google% AND DynamicFilter [ empty ]

EXPLAIN ANALYZE metrics (SELECT *)

DataSourceExec

Metric Value
projection 105 columns
output_rows 97.17 M
output_bytes 137.8 GB
row_groups_pruned_statistics 325 → 325 matched
row_groups_pruned_bloom_filter 325 → 325 matched
page_index_pages_pruned 0

What duckdb does in this scenario is that it materialises the projection rows that are not needed for any computation or maybe needed in later steps as late as possilble.

But this late materialization comes at the cost of requiring a self join later on to get the other columns back, and got self join's we need row tagging.

In datafusion we can tag a row uniquely with file_row_index(), input_file_name().
But then if rows are not dynamically filtered on the basis of above two columns it will again lead to a full table scan for probe side of the join.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions