Skip to content

ART prefix refactor #7930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
Jun 29, 2023
Merged

Conversation

taniabogatsch
Copy link
Contributor

This PR refactors the prefix handling of our ART index. Additionally, it fully reworks the iterator code and increases the code coverage of the ART.

Before this PR, there were seven ART nodes. Most of these had an inlined Prefix. That prefix contained a count of 4 bytes, and either 8 bytes of prefix data, or a Node pointer of 8 bytes, so in total 12 bytes.

With this PR, we introduce a new prefix node. This prefix node contains data and a node pointer to a subsequent node. The data consists of Node::PREFIX_SIZE + 1 bytes, of which the first byte is the count of that prefix node, and the other bytes are the prefix data. For prefixes exceeding Node::PREFIX_SIZE, the prefix down to a non-prefix node becomes a chain of prefix nodes, otherwise the prefix is stored in a single prefix node. If there is no prefix, then we directly traverse to the subsequent node.

uint8_t data[Node::PREFIX_SIZE + 1];
Node ptr;

This change has two main benefits.

  • In the absence of a prefix, we do not allocate any memory for it.
  • It enables us to rework the leaves in a future PR, which take up the majority of the memory of the ART. Due to this PR, leaves no longer contain a prefix. If a leaf only contains one row ID, we can now inline that row ID directly into the Node pointer of the previous node, completely omitting the need for a separate Leaf node. [ART] RowID Leaf node specialization #5365

This PR slightly improves the memory consumption for longer keys and does not regress in the presence of shorter keys. The ART still has a fairly high memory pressure, and the goal is to further decrease this in future PRs.

Workload Current Prefix Refactor
10M mostly unique 64bit integers 429.1MB 409.2MB
1M 64bit integers with many duplicates 16.5MB 16.7MB
10M non-inlined strings 1.0GB 866.6MB

Also related to #5865.

taniabogatsch and others added 30 commits May 4, 2023 18:56
# Conflicts:
#	src/execution/index/art/leaf.cpp
#	src/execution/index/art/leaf_segment.cpp
#	src/execution/index/art/node.cpp
#	src/execution/index/art/prefix.cpp
#	src/include/duckdb/execution/index/art/fixed_size_allocator.hpp
#	src/include/duckdb/execution/index/art/node.hpp
# Conflicts:
#	test/sql/copy/parquet/parquet_glob.test
# Conflicts:
#	extension/json/json_functions.cpp
Copy link
Contributor

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good. I just had a small comment about one of the functions.
In general, the refactored code for the iterator looks much better.
One thing I wonder is if this has any impact on performance, I understand that with this change, our ART is potentially smaller since it won't allocate prefix memory if it's not necessary, but I guess it will increase pointer chasing, especially if every node has a prefix before it.

@taniabogatsch
Copy link
Contributor Author

That's a fair point! To investigate this possible overhead, I can add some benchmarks for Lookup times, maybe during constraint checking. I should also make some existing benchmarks more complex (less sorted :D), especially during insertions and deletions.

# Conflicts:
#	test/api/test_reset.cpp
@taniabogatsch
Copy link
Contributor Author

I will open a separate PR (once the CI passes locally) to refactor the testing and benchmarking of the ART. That also adds a benchmark for Lookup performance to ensure we don't regress.

@taniabogatsch taniabogatsch marked this pull request as draft June 21, 2023 12:00
# Conflicts:
#	test/sql/storage/test_index_checkpoint.test
# Conflicts:
#	.github/config/uncovered_files.csv
@Mytherin Mytherin marked this pull request as ready for review June 25, 2023 08:33
@github-actions github-actions bot marked this pull request as draft June 26, 2023 13:12
@taniabogatsch taniabogatsch marked this pull request as ready for review June 27, 2023 09:14
@taniabogatsch taniabogatsch requested a review from pdet June 27, 2023 09:14
@taniabogatsch
Copy link
Contributor Author

@pdet I added more tests and a Lookup performance benchmark in #8055.

@github-actions github-actions bot marked this pull request as draft June 27, 2023 12:09
@taniabogatsch taniabogatsch marked this pull request as ready for review June 27, 2023 12:10
@github-actions github-actions bot marked this pull request as draft June 27, 2023 12:16
@taniabogatsch taniabogatsch marked this pull request as ready for review June 27, 2023 12:17
@github-actions github-actions bot marked this pull request as draft June 27, 2023 19:19
@taniabogatsch taniabogatsch marked this pull request as ready for review June 27, 2023 19:23
@taniabogatsch
Copy link
Contributor Author

@pdet local benchmark comparison, since the Lookup benchmark is still on feature. So there is no regression; the changes to the prefix increase lookup-performance slightly.

benchmark feature PR
lookup 1.806496 1.709860

Copy link
Contributor

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the benchmarks and adding as regression Tania!

It looks great!

@Mytherin Mytherin merged commit 188de57 into duckdb:feature Jun 29, 2023
@Mytherin
Copy link
Collaborator

Thanks! Looks great

@taniabogatsch taniabogatsch deleted the art-prefix-refactor branch June 29, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants