Storage: Lazily Load Row Groups from Tables #6715
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR modifies the initial startup of a database so that row groups are lazily constructed, instead of eagerly constructed. This is generally the largest contributor to start-up cost of large databases as the amount of row groups grows along with the size of tables. The lazy row group construction is done in the
SegmentTreestructure. Whenever a row group is requested from the segment tree, row groups are loaded as needed. This also means that not necessarily all the row groups have to be loaded. For example, when running aLIMITon a table, we only need to load the initial row groups.Below are some timings run on a database file containing a single large
lineitemtable at scale factor 120 (or rather, 120x SF1 lineitem). This is around 770 million rows, or 6K row groups.Segment Tree Restructuring
This PR restructures the segment tree so that it is fully templated, and also adds templating to the
SegmentBase. This avoids the need for casting to and from theSegmentBaseclass everywhere.RowGroupnow inherits fromSegmentBase<RowGroup>and there is aRowGroupSegmentTreethat inherits fromSegmentTree<RowGroup>.Rather than directly accessing the next pointer of the
SegmentBase, theSegmentTree:: GetNextSegmentis now always called. This will usually still follow the next pointer (as that is the most efficient) - but might trigger the lazy loading of segments if necessary.In addition, other methods (like
GetRootSegment,GetSegmentByIndex, etc) are also overloaded to correctly trigger lazy loading of row groups if required.Zone Maps Optimization
This PR also adds optimizations around the pushdown of zonemaps to avoid value construction in the NumericStats. This speeds up queries whose query time is dominated by zonemap skipping significantly, which can happen if we are selecting a small subset of data from a large table with a lot of row groups.