Skip to content

Storage: Lazily Load Row Groups from Tables #6715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 15, 2023

Conversation

Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented Mar 14, 2023

This PR modifies the initial startup of a database so that row groups are lazily constructed, instead of eagerly constructed. This is generally the largest contributor to start-up cost of large databases as the amount of row groups grows along with the size of tables. The lazy row group construction is done in the SegmentTree structure. Whenever a row group is requested from the segment tree, row groups are loaded as needed. This also means that not necessarily all the row groups have to be loaded. For example, when running a LIMIT on a table, we only need to load the initial row groups.

Below are some timings run on a database file containing a single large lineitem table at scale factor 120 (or rather, 120x SF1 lineitem). This is around 770 million rows, or 6K row groups.

Query v0.6.1 v0.7.1 master new Parquet
SELECT 42 1.60s 0.31s 0.21s 0.02s -
FROM lineitem LIMIT 1; 1.62s 0.32s 0.22s 0.03s 0.27s

Segment Tree Restructuring

This PR restructures the segment tree so that it is fully templated, and also adds templating to the SegmentBase. This avoids the need for casting to and from the SegmentBase class everywhere. RowGroup now inherits from SegmentBase<RowGroup> and there is a RowGroupSegmentTree that inherits from SegmentTree<RowGroup>.

Rather than directly accessing the next pointer of the SegmentBase, the SegmentTree:: GetNextSegment is now always called. This will usually still follow the next pointer (as that is the most efficient) - but might trigger the lazy loading of segments if necessary.

In addition, other methods (like GetRootSegment, GetSegmentByIndex, etc) are also overloaded to correctly trigger lazy loading of row groups if required.

Zone Maps Optimization

This PR also adds optimizations around the pushdown of zonemaps to avoid value construction in the NumericStats. This speeds up queries whose query time is dominated by zonemap skipping significantly, which can happen if we are selecting a small subset of data from a large table with a lot of row groups.

@Mytherin Mytherin merged commit 5a409f9 into duckdb:master Mar 15, 2023
@Mytherin Mytherin deleted the lazyloadmetadata branch April 24, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant