-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bt: Implementing binary tree backed by a vector. #1080
Conversation
src/bt.rs
Outdated
} | ||
|
||
/// Gets a mutable reference to the node located at the end of the path. | ||
/// | ||
/// This function traverses the tree from this node (`self`) until reaching the | ||
/// This function traverses the tree from the root node until reaching the | ||
/// node at the end of the path. It returns [None], if the node is | ||
/// unreachable or nonexistent. Otherwise, it returns a mutable reference | ||
/// to the node. | ||
pub fn get_node(&mut self, path: &Path) -> Option<&mut Node<V>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q. What is the intended use of this function? It will return a mutable reference to a Node
, but all of Node
's fields are private and Node
has no publicly-accessible functionality other than an Encode
implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There used to be a Node::insert()
, among others. The idea was that lookups and insertions starting from an intermediate node could skip the first part of tree traversal by retaining a reference to a node. I think this would have to be split out to a new type, like NodeRef<'a, V>
, to work with the new arena-based approach without running afoul of borrowck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I am using NodeRef
that store the index of the node in the tree. This reference is returned every time a node is inserted. In addition, the insert_at
method can take a node reference so during insertion, the traversal can start from other node different than the root.
src/bt.rs
Outdated
/// node at the end of the path. It returns [None], if the node is | ||
/// unreachable or nonexistent. Otherwise, it returns a reference to the | ||
/// value stored in the node. | ||
pub fn get(&self, path: &Path) -> Option<&V> { | ||
let mut node = self; | ||
let mut node = self.root; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this tree-traversal code is more-or-less duplicated several times (insert
, get
, and get_node
); can we factor this out to a single implementation that we call into at each location that needs to traverse a tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect we will need two traversal routines, one with & references and one with &mut references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I factored out the traversal, let me know what you think. Also due to mutability we need to (almost) duplicate code.
src/bt.rs
Outdated
value: V::decode(bytes)?, | ||
left: Option::<usize>::decode(bytes)?, | ||
right: Option::<usize>::decode(bytes)?, | ||
}) | ||
} | ||
} | ||
|
||
impl<V: Encode> Encode for BinaryTree<V> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, note that this will be the first type for which its encoded form is not uniquely determined by its semantic meaning. (since the order of nodes and the offsets within them are determined by the in-memory layout) I think this is fine, since this serialized output won't be used in any sorts of commitments, etc., and will just be deserialized again later by the same party.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now serialization is based on a pre-order traversal, which guarantee a specific order.
Note that NodeRef are valid while the tree is in memory.
src/bt.rs
Outdated
} | ||
|
||
/// Gets a mutable reference to the node located at the end of the path. | ||
/// | ||
/// This function traverses the tree from this node (`self`) until reaching the | ||
/// This function traverses the tree from the root node until reaching the | ||
/// node at the end of the path. It returns [None], if the node is | ||
/// unreachable or nonexistent. Otherwise, it returns a mutable reference | ||
/// to the node. | ||
pub fn get_node(&mut self, path: &Path) -> Option<&mut Node<V>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There used to be a Node::insert()
, among others. The idea was that lookups and insertions starting from an intermediate node could skip the first part of tree traversal by retaining a reference to a node. I think this would have to be split out to a new type, like NodeRef<'a, V>
, to work with the new arena-based approach without running afoul of borrowck.
src/bt.rs
Outdated
/// node at the end of the path. It returns [None], if the node is | ||
/// unreachable or nonexistent. Otherwise, it returns a reference to the | ||
/// value stored in the node. | ||
pub fn get(&self, path: &Path) -> Option<&V> { | ||
let mut node = self; | ||
let mut node = self.root; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect we will need two traversal routines, one with & references and one with &mut references.
enum Ref<'a, V> { | ||
This(&'a mut Node<V>), | ||
Other(&'a mut Option<Box<Node<V>>>), | ||
pub fn insert_at( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: insert_at
& insert
are nearly identical, except that insert
can insert into an empty tree while insert_at(NodeRef::ROOT, /* pseudocode */ Path::EMPTY, value)
will fail. Can we factor insert_at
such that insert
's body is just a call to insert_at(NodeRef::ROOT, path, value)
?
To make this change, I think we'd need to special-case the root-insertion case in one or two places, but that's no worse than special-casing it in insert
IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been refactored now
src/bt.rs
Outdated
self.root.as_mut().and_then(|node| node.get_node(path)) | ||
/// Checks whether the node reference is valid with respect to the tree. | ||
fn is_valid_node_ref(&self, node_ref: NodeRef) -> bool { | ||
!self.nodes.is_empty() && node_ref.0 < self.nodes.len() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think the !self.nodes.is_empty()
check is unnecessary: the node_ref.0
is a usize
and therefore nonnegative, so if nodes
is empty, node_ref.0 < self.nodes.len()
will be equivalent to node_ref.0 < 0
, which is always false.
folks, this is ready for another round of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, but before we land, we should make sure that this is actually faster. Please add some benchmarks and report the performance change in a comment.
if !(self.is_valid_node_ref(node_ref) | ||
|| (self.nodes.is_empty() && node_ref == NodeRef::ROOT)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if the second term in this ||
should be implied by the first? I.e., if the list of nodes is empty, and the refernece is to the root, then it's a valid reference, correct?
Put simply: consider moving (self.nodes.is_empty() && node_ref == NodeRef::ROOT)
into is_valid_node_ref()
.
} | ||
|
||
Ok(()) | ||
*node = Some(new_index); | ||
self.nodes.push(Node::new(value)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably just misreading the code, but I would expect that, somewhere in this function, we'd need to update the parent of the node we just inserted so that their node reference points to the node we just pushed. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, node
is a mutable reference to the Option<usize>
from the parent node's left
or right
link. The name node
may be a bit misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM; I agree with cjpatton that a benchmark is warranted.
I think this code deserves a comment on why a generic binary tree from a crate wouldn't do or why this shouldn't be a separate crate. It looks like a fairly generic interface to have a binary tree without node removal. |
I updated the description of the PR (top). Now it includes timings that compare the current implementation with the implementation in this PR. The savings are more significant when using large paths (i.e. large tree's height). Also, insertion in-place offers significant savings too. The code for benchmark is here. |
That doesn't seem to be corroborated by the timings you posted:
From this table it looks like inserting from the root is significantly slower for the new code. |
It appears that insert-from-root only becomes slower once |
I revisited the benchmarks and updated the bench script to compute the task correctly. As @branlwyd pointed out, the time for inserting values consistently increases as the length of paths grows. With these new measurements, it's clear that the new version (using N is length of path. Inserting at Root
Inserting at Node
|
Alrighty, this seems pretty definitive! @armfazh let's close this PR. Thanks for putting in the time to investigate this. |
Related to #947.
Using a vector of nodes to represent an append-only binary tree.
E.g. This is the representation of a tree with three nodes.
In-memory storage:
Encoded representation: Nodes are serialized in a pre-order ordering.
The table of nodes starts with
NODES_CAPACITY=256
nodes pre-allocated in order to reduce frequent allocations.Benchmark Comparison
Code for the benchmark comparison: armfazh#3
Task: Insert
NUM_PATHS=1000
random paths of length N.Old: Uses canonical pointers data structure (main).
New: Uses a table (
Vec
) to store nodes (this PR).Root: Traverses the tree from the root node.
Node: Traverses the tree from the last node inserted.
See new timings below in the thread of comments.