Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bt: Implementing binary tree backed by a vector. #1080

Closed
wants to merge 5 commits into from

Conversation

armfazh
Copy link
Contributor

@armfazh armfazh commented Jun 14, 2024

Related to #947.

Using a vector of nodes to represent an append-only binary tree.

E.g. This is the representation of a tree with three nodes.

flowchart TB
    A --> B
    A --> C
Loading

In-memory storage:

NodeRef Node Left Right
0 A 1 2
1 B None None
2 C None None

Encoded representation: Nodes are serialized in a pre-order ordering.

The table of nodes starts with NODES_CAPACITY=256 nodes pre-allocated in order to reduce frequent allocations.

Benchmark Comparison

Code for the benchmark comparison: armfazh#3

Task: Insert NUM_PATHS=1000 random paths of length N.

Old: Uses canonical pointers data structure (main).
New: Uses a table (Vec) to store nodes (this PR).

Root: Traverses the tree from the root node.
Node: Traverses the tree from the last node inserted.

See new timings below in the thread of comments.

@armfazh armfazh requested a review from a team as a code owner June 14, 2024 22:43
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Show resolved Hide resolved
src/bt.rs Outdated
}

/// Gets a mutable reference to the node located at the end of the path.
///
/// This function traverses the tree from this node (`self`) until reaching the
/// This function traverses the tree from the root node until reaching the
/// node at the end of the path. It returns [None], if the node is
/// unreachable or nonexistent. Otherwise, it returns a mutable reference
/// to the node.
pub fn get_node(&mut self, path: &Path) -> Option<&mut Node<V>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. What is the intended use of this function? It will return a mutable reference to a Node, but all of Node's fields are private and Node has no publicly-accessible functionality other than an Encode implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be a Node::insert(), among others. The idea was that lookups and insertions starting from an intermediate node could skip the first part of tree traversal by retaining a reference to a node. I think this would have to be split out to a new type, like NodeRef<'a, V>, to work with the new arena-based approach without running afoul of borrowck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, I am using NodeRef that store the index of the node in the tree. This reference is returned every time a node is inserted. In addition, the insert_at method can take a node reference so during insertion, the traversal can start from other node different than the root.

src/bt.rs Outdated
/// node at the end of the path. It returns [None], if the node is
/// unreachable or nonexistent. Otherwise, it returns a reference to the
/// value stored in the node.
pub fn get(&self, path: &Path) -> Option<&V> {
let mut node = self;
let mut node = self.root;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this tree-traversal code is more-or-less duplicated several times (insert, get, and get_node); can we factor this out to a single implementation that we call into at each location that needs to traverse a tree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we will need two traversal routines, one with & references and one with &mut references.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I factored out the traversal, let me know what you think. Also due to mutability we need to (almost) duplicate code.

src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated
value: V::decode(bytes)?,
left: Option::<usize>::decode(bytes)?,
right: Option::<usize>::decode(bytes)?,
})
}
}

impl<V: Encode> Encode for BinaryTree<V> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, note that this will be the first type for which its encoded form is not uniquely determined by its semantic meaning. (since the order of nodes and the offsets within them are determined by the in-memory layout) I think this is fine, since this serialized output won't be used in any sorts of commitments, etc., and will just be deserialized again later by the same party.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now serialization is based on a pre-order traversal, which guarantee a specific order.
Note that NodeRef are valid while the tree is in memory.

src/bt.rs Show resolved Hide resolved
src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Show resolved Hide resolved
src/bt.rs Outdated
}

/// Gets a mutable reference to the node located at the end of the path.
///
/// This function traverses the tree from this node (`self`) until reaching the
/// This function traverses the tree from the root node until reaching the
/// node at the end of the path. It returns [None], if the node is
/// unreachable or nonexistent. Otherwise, it returns a mutable reference
/// to the node.
pub fn get_node(&mut self, path: &Path) -> Option<&mut Node<V>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There used to be a Node::insert(), among others. The idea was that lookups and insertions starting from an intermediate node could skip the first part of tree traversal by retaining a reference to a node. I think this would have to be split out to a new type, like NodeRef<'a, V>, to work with the new arena-based approach without running afoul of borrowck.

src/bt.rs Outdated
/// node at the end of the path. It returns [None], if the node is
/// unreachable or nonexistent. Otherwise, it returns a reference to the
/// value stored in the node.
pub fn get(&self, path: &Path) -> Option<&V> {
let mut node = self;
let mut node = self.root;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we will need two traversal routines, one with & references and one with &mut references.

enum Ref<'a, V> {
This(&'a mut Node<V>),
Other(&'a mut Option<Box<Node<V>>>),
pub fn insert_at(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: insert_at & insert are nearly identical, except that insert can insert into an empty tree while insert_at(NodeRef::ROOT, /* pseudocode */ Path::EMPTY, value) will fail. Can we factor insert_at such that insert's body is just a call to insert_at(NodeRef::ROOT, path, value)?

To make this change, I think we'd need to special-case the root-insertion case in one or two places, but that's no worse than special-casing it in insert IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been refactored now

src/bt.rs Outdated Show resolved Hide resolved
src/bt.rs Outdated
self.root.as_mut().and_then(|node| node.get_node(path))
/// Checks whether the node reference is valid with respect to the tree.
fn is_valid_node_ref(&self, node_ref: NodeRef) -> bool {
!self.nodes.is_empty() && node_ref.0 < self.nodes.len()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the !self.nodes.is_empty() check is unnecessary: the node_ref.0 is a usize and therefore nonnegative, so if nodes is empty, node_ref.0 < self.nodes.len() will be equivalent to node_ref.0 < 0, which is always false.

src/bt.rs Outdated Show resolved Hide resolved
@armfazh armfazh requested a review from branlwyd June 26, 2024 16:20
@armfazh
Copy link
Contributor Author

armfazh commented Jun 26, 2024

folks, this is ready for another round of review.

Copy link
Collaborator

@cjpatton cjpatton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but before we land, we should make sure that this is actually faster. Please add some benchmarks and report the performance change in a comment.

Comment on lines +163 to +164
if !(self.is_valid_node_ref(node_ref)
|| (self.nodes.is_empty() && node_ref == NodeRef::ROOT))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if the second term in this || should be implied by the first? I.e., if the list of nodes is empty, and the refernece is to the root, then it's a valid reference, correct?

Put simply: consider moving (self.nodes.is_empty() && node_ref == NodeRef::ROOT) into is_valid_node_ref().

}

Ok(())
*node = Some(new_index);
self.nodes.push(Node::new(value));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably just misreading the code, but I would expect that, somewhere in this function, we'd need to update the parent of the node we just inserted so that their node reference points to the node we just pushed. What am I missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, node is a mutable reference to the Option<usize> from the parent node's left or right link. The name node may be a bit misleading.

Copy link
Member

@branlwyd branlwyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM; I agree with cjpatton that a benchmark is warranted.

@simon-friedberger
Copy link
Contributor

I think this code deserves a comment on why a generic binary tree from a crate wouldn't do or why this shouldn't be a separate crate. It looks like a fairly generic interface to have a binary tree without node removal.

@armfazh
Copy link
Contributor Author

armfazh commented Jul 10, 2024

I updated the description of the PR (top). Now it includes timings that compare the current implementation with the implementation in this PR.

The savings are more significant when using large paths (i.e. large tree's height). Also, insertion in-place offers significant savings too.

The code for benchmark is here.

@armfazh armfazh requested a review from cjpatton July 10, 2024 18:33
@cjpatton
Copy link
Collaborator

The savings are more significant when using large paths (i.e. large tree's height). Also, insertion in-place offers significant savings too.

That doesn't seem to be corroborated by the timings you posted:

N=8

Insertion at Old New
Root 199.47 µs 185.05 µs
Node 235.31 µs 180.04 µs

N=64

Insertion at Old New
Root 5.6078 ms 5.3975 ms
Node 2.8733 ms 1.0842 ms

N=256

Insertion at Old New
Root 66.765 ms 93.256 ms
Node 11.680 ms 4.219 ms

From this table it looks like inserting from the root is significantly slower for the new code.

@branlwyd
Copy link
Member

It appears that insert-from-root only becomes slower once N becomes large enough. If N continues to increase, does the performance difference become larger?

@armfazh
Copy link
Contributor Author

armfazh commented Jul 12, 2024

I revisited the benchmarks and updated the bench script to compute the task correctly.

As @branlwyd pointed out, the time for inserting values consistently increases as the length of paths grows.

With these new measurements, it's clear that the new version (using std::Vec) is no better than the current code.
One reason that explains this is because the expression self.nodes[node] is a runtime checked (unlike in C). I tried to use the unsafe get_unchecked method but only gives 15% savings, it's not closer to the current code.

N is length of path.

Inserting at Root

N Old (Tree) New (Vector)
16 293.27 µs 353.73 µs
64 3.56 ms 5.21 ms
128 14.46 ms 22.05 ms
256 58.81 ms 91.52 ms
512 239.45 ms 374.53 ms

Inserting at Node

N Old (Tree) New (Vector)
16 172.80 µs 212.11 µs
64 686.39 µs 853.18 µs
128 1.3668 ms 1.6735 ms
256 2.7321 ms 3.3498 ms
512 5.4535 ms 6.6705 ms

@cjpatton
Copy link
Collaborator

Alrighty, this seems pretty definitive! @armfazh let's close this PR. Thanks for putting in the time to investigate this.

@cjpatton cjpatton closed this Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants