Skip to content

Late Materialization Optimizer #15692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Jan 14, 2025
Merged

Conversation

Mytherin
Copy link
Collaborator

This PR adds the Late Materialization optimizer that enables late materialization for certain queries - in particular top-n (ORDER BY .. LIMIT ..), limit + offset, and sample queries. The optimization piggy-backs off of the row-id filter pushdown introduced in #15020 - and does the row-id rewrites mentioned there automatically.

Rewrites

Here are some examples of rewrites:

Top-N
SELECT * FROM lineitem ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem WHERE rowid IN (
    SELECT rowid FROM lineitem ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;
Limit + Offset
SELECT * FROM lineitem LIMIT 1000000 OFFSET 10000000;

-- becomes


SELECT * FROM lineitem WHERE rowid IN (
    SELECT rowid FROM lineitem LIMIT 1000000 OFFSET 10000000)
ORDER BY rowid;

Performance

SELECT * FROM lineitem ORDER BY l_extendedprice DESC LIMIT 5;
v1.1.3 main new
2.1s 0.07s 0.01s
SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 5;
v1.1.3 main new
10.2s 2.8s 0.14s
SELECT * FROM lineitem LIMIT 1000000 OFFSET 10000000;
v1.1.3 main new
0.46s 0.46s 0.07s

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 13, 2025 16:48
@Mytherin Mytherin marked this pull request as ready for review January 13, 2025 17:09
@abramk
Copy link
Contributor

abramk commented Jan 13, 2025

@Mytherin I tried your PR against the following and am not seeing a speed up.

 create table tbl1(pk LONG PRIMARY KEY,
        col_long_text VARCHAR,
        col_long_text_2 VARCHAR,
        col_truncated VARCHAR,
        col_sort LONG);

 INSERT INTO tbl1 SELECT
        i AS pk,
        rpad('', 64000, 'A') AS col_long_text,
        rpad('', 64000, 'B') AS col_long_text_2,
        rpad('', 4000, 'A') AS col_truncated,
        floor(random() * 9223372036854775807)::long AS col_sort FROM generate_series(1, 50000) s(i);

I then run this query:

 select * from tbl1 order by pk desc limit 1000;

and it takes 1.477 s but when I run the transformed version manually:

select * from tbl1 where rowid in (select rowid from tbl1 order by pk desc limit 1000);

I see it takes 155ms.

I think I have your PR compiled correctly as PRAGMA profiling_mode = 'detailed' shows:

┌────────────────────────────────────────────────┐
│               Optimizer: 0.0002s               │
│┌──────────────────────────────────────────────┐│
││        Build Side Probe Side: 0.0000s        ││
││           Column Lifetime: 0.0000s           ││
││           Common Aggregate: 0.0000s          ││
││        Common Subexpressions: 0.0000s        ││
││      Compressed Materialization: 0.0000s     ││
││          Cte Filter Pusher: 0.0000s          ││
││             Deliminator: 0.0000s             ││
││           Duplicate Groups: 0.0000s          ││
││         Empty Result Pullup: 0.0000s         ││
││         Expression Rewriter: 0.0000s         ││
││              Extension: 0.0000s              ││
││            Filter Pullup: 0.0000s            ││
││           Filter Pushdown: 0.0000s           ││
││              In Clause: 0.0000s              ││
││         Join Filter Pushdown: 0.0000s        ││
││              Join Order: 0.0000s             ││
││         Late Materialization: 0.0000s        ││
││            Limit Pushdown: 0.0000s           ││
││           Materialized Cte: 0.0000s          ││
││             Regex Range: 0.0000s             ││
││            Reorder Filter: 0.0000s           ││
││          Sampling Pushdown: 0.0000s          ││
││        Statistics Propagation: 0.0000s       ││
││             Sum Rewriter: 0.0000s            ││
││                Top N: 0.0000s                ││
││           Unnest Rewriter: 0.0000s           ││
││            Unused Columns: 0.0000s           ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘

You can see the Late Materialization filter in the list.

@Mytherin
Copy link
Collaborator Author

That's because of the limit size. The late materialization optimizer only triggers when the limit is <= 50 at this point (see max_row_count in the code). I can make this configurable in a follow-up PR.

@duckdb-draftbot duckdb-draftbot marked this pull request as draft January 13, 2025 18:20
@Mytherin Mytherin marked this pull request as ready for review January 13, 2025 18:20
@carlopi
Copy link
Contributor

carlopi commented Jan 14, 2025

Minor, also to be done on as subsequent PR, there are 3 warnings turned errors in the amalgamation run (OSX Debug):

-----------------------
-- Writing src/amalgamation/duckdb.hpp --
-----------------------
------------------------
-- Writing src/amalgamation/duckdb.cpp --
------------------------
duckdb.cpp:253075:11: error: redundant move in return statement [-Werror,-Wredundant-move]
                        return std::move(op);
                               ^
duckdb.cpp:253075:11: note: remove std::move call here
                        return std::move(op);
                               ^~~~~~~~~~  ~
duckdb.cpp:2[53](https://github.com/Mytherin/duckdb/actions/runs/12753535210/job/35545445957#step:10:54)086:11: error: redundant move in return statement [-Werror,-Wredundant-move]
                        return std::move(op);
                               ^
duckdb.cpp:253086:11: note: remove std::move call here
                        return std::move(op);
                               ^~~~~~~~~~  ~
duckdb.cpp:253099:11: error: redundant move in return statement [-Werror,-Wredundant-move]
                        return std::move(op);
                               ^
duckdb.cpp:253099:11: note: remove std::move call here
                        return std::move(op);
                               ^~~~~~~~~~  ~
3 errors generated.

@Mytherin Mytherin merged commit 2a1d5e2 into duckdb:main Jan 14, 2025
49 checks passed
Mytherin added a commit that referenced this pull request Jan 14, 2025
…e the threshold at which late materialization is triggered (#15697)

Follow-up from #15692

This PR adds the `late_materialization_max_rows` setting that allows you
to configure the threshold at which late materialization is triggered.
The default value is `50`.

Example usage:

```sql
SET late_materialization_max_rows=1000;
explain SELECT * FROM lineitem ORDER BY l_orderkey DESC LIMIT 1000;
```

The exact best setting is hard to determine - essentially the row-id
pushdown has two components to it (1) the OR filter push-down, which is
done for up to `dynamic_or_filter_threshold` rows (defaults to 50) and
the min-max filter pushdown.

The row-id rewrite generally always provides performance improvements
for up to `dynamic_or_filter_threshold` which is why we select that as a
default. Beyond that, it depends on the locality of the rows. If the
min-max filter on row-ids is selective (i.e. the rows we select are
close together physically in the table) the row-id rewrite is effective.
If the rows are spread out, the rewrite can worsen performance.

CC @abramk
@Mytherin Mytherin deleted the latematerialization branch January 16, 2025 16:53
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Feb 2, 2025
Late Materialization Optimizer (duckdb/duckdb#15692)
Clean up temporary test directory in `run_tests_one_by_one.py` even if test segfaults (duckdb/duckdb#15688)
Reduce test size so CI is less likely to fail (duckdb/duckdb#15689)
@YuweiXiao
Copy link

hey! wondering if this optimization applies for non-duckdb table, e.g., a delta lake table. Thanks!

Mytherin added a commit that referenced this pull request May 2, 2025
…(*)` directly in the multi file reader (#17325)

This PR generalizes the late materialization optimizer introduced in
#15692 - allowing it to be used for
the Parquet reader.

In particular, the `TableFunction` is extended with an extra callback
that allows specifying the relevant row-id columns:

```cpp
typedef vector<column_t> (*table_function_get_row_id_columns)(ClientContext &context,
                                                              optional_ptr<FunctionData> bind_data);
```

This is then used by the Parquet reader to specify the two row-id
columns: `file_index` (#17144) and
`file_row_number` (#16979). Top-N ,
sample and limit/offset queries are then transformed into a join on the
relevant row-id columns. For example:


```sql
SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;

-- becomes

SELECT * FROM lineitem.parquet WHERE (file_index, file_row_number) IN (
    SELECT file_index, file_row_number FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5)
ORDER BY l_extendedprice DESC;
```

### Performance

```sql
SELECT * FROM lineitem.parquet ORDER BY l_extendedprice DESC LIMIT 5;
```

| v1.2.1 |  main  |  new   |
|--------|--------|--------|
| 0.19s  | 0.14s | 0.06s |


```sql
SELECT * FROM lineitem.parquet ORDER BY l_orderkey DESC LIMIT 5;
```
| v1.2.1 | main  |  new  |
|--------|-------|-------|
| 0.73s  | 0.53s | 0.06s |

```sql
SELECT * FROM lineitem.parquet LIMIT 1000000 OFFSET 10000000;
```
| v1.2.1 | main  |  new  |
|--------|-------|-------|
| 1.6s   | 1.2s | 0.14s |


### Refactor

I've also moved the `ParquetMultiFileInfo` to a separate file as part of
this PR - which is most of the changes here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants