Late materialization when LIMIT prunes heavily.

Currently ClickBench Q23 is a major outliner due to the fact that datafusion doesn't do late materialization of projected columns. 

```
SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
```
A query like this needs to scan all the columns only after all the redundant rows are prunned from output.  

Some numbers that I was able to get locally showing this regression in play. 

| Engine | Query time |
|--------|------------|
| DataFusion 54 (`SELECT *`) | ~4.0s avg |
| DataFusion 54 (`SELECT "EventTime", "URL"`) | ~0.9s |


```text
SortPreservingMergeExec (limit 10)
└─ SortExec TopK (EventTime), DynamicFilter [ empty ]
   └─ ProjectionExec (105 columns)
      └─ FilterExec (URL LIKE '%google%')
         └─ DataSourceExec (projection=[105 cols], 111 files)
              predicate: URL LIKE %google% AND DynamicFilter [ empty ]
```

### `EXPLAIN ANALYZE` metrics (`SELECT *`)

**DataSourceExec**

| Metric | Value |
|--------|-------|
| `projection` | 105 columns |
| `output_rows` | 97.17 M |
| `output_bytes` | 137.8 GB |
| `row_groups_pruned_statistics` | 325 → 325 matched |
| `row_groups_pruned_bloom_filter` | 325 → 325 matched |
| `page_index_pages_pruned` | 0 |



What duckdb does in this scenario is that it materialises the projection rows that are not needed for any computation or maybe needed in later steps as late as possilble. 

But this late materialization comes at the cost of requiring a self join later on to get the other columns back, and got self join's we need row tagging. 

In datafusion we can tag a row uniquely with `file_row_index()`, `input_file_name()`. 
But then if rows are not dynamically filtered on the basis of above two columns it will again lead to a full table scan for probe side of the join. 







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Late materialization when LIMIT prunes heavily. #23263

`EXPLAIN ANALYZE` metrics (`SELECT *`)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Engine	Query time
DataFusion 54 (`SELECT *`)	~4.0s avg
DataFusion 54 (`SELECT "EventTime", "URL"`)	~0.9s

Metric	Value
`projection`	105 columns
`output_rows`	97.17 M
`output_bytes`	137.8 GB
`row_groups_pruned_statistics`	325 → 325 matched
`row_groups_pruned_bloom_filter`	325 → 325 matched
`page_index_pages_pruned`	0

Uh oh!

Late materialization when LIMIT prunes heavily. #23263

Description

EXPLAIN ANALYZE metrics (SELECT *)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`EXPLAIN ANALYZE` metrics (`SELECT *`)