-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/sorted sets #523
Feature/sorted sets #523
Conversation
If two nodes overlap, then they are equal. otherwise, compare them based on the flake interval they contain.
Use persistent sorted set's efficient slices so we don't have to consider all of a node's children (linear time) and we can instead limit it to only the children we care about in logarithmic time.
converted to draft because i've found some issues with queries using the file connection and loaded dbs |
@fluree/core I'm still characterizing the indexing and querying performance on large ledgers, but this patch is ready for review |
The changes look good to me, thanks for making your commits so focused - it made reviewing much simpler. Since this is a large change I want to give others some chances to review and also wait until you've got a feel for any perf regressions. |
Fixes fluree/core#18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I appreciate the cleanup, and I'm happy to see the tree-chan
api get a little simpler as a bonus. 😄
I know you're still measuring performance implications, so I will also hold off on a formal 👍 for now, but wanted to comment so it's clear I did get a chance to review.
Unfortunately, persistent-sorted-set shows at least a 2x slowdon over clojure.data.avl when accessing the results of sorted set subscans. pss returns a lazy seq while avl returns a fully realized sorted set, and we pay the penalty whenever we consume the entire seq, which we always will do. Because of this performance regression, I'm closing this without merging. |
This patch replaces clojure.data.avl with persistent-sorted-set as the sorted set implementation backing the 5 core indexes.
A number of changes had to be made to the internal indexing apis to make them compatible with persistent-sorted-set:
:first-flake
and:rhs
attributes are each set to the specified endpoint. The index node comparator defined above will return the correct range using these two nodes.subrange
api. Changes have been made here to ensure that no set operations were ever directly performed on slices.>
or<
order checks in itsslice
api, opting instead to only implicitly support>=
and<=
. This could mean that there could be two index nodes which share and endpoint (the:rhs
of one is equal to the:first-flake
of the next). I don't think this will be an issue in practices because this should only matter when integrating novelty into existing nodes, and all the nodes in novelty should have at least a differentt
value than the flakes in any of the existing nodes, so we shouldn't run into nodes that actually share endpoints in practice. The other place this might matter is while re-balancing leaf nodes when the overflow limit is hit, but this patch includes code to explicitly drop overlapping flakes after a split operation.Some of the behavior of the internal apis also have changed to make use of the efficiencies of persistent sorted set. Now, we must pass a collection to
sorted-set-by
instead of individual flakes avoiding the need forapply
. Also,sorted-set-by
now expects the supplied collection of flakes to already be sorted because this kind of set creation is heavily optimized in persistent-sorted-set. All of the flake collections we had were already sorted anyway, so this shouldn't prove to be an issue.Besides those changes, I took the time to simplify the api of
tree-chan
to cut down on the transducers we have to supply to it. Instead of always needing to limit the flakes returned between starting endpoints, that is now done automatically by passing the endpoints in as arguments. Also, there's no need for aninclude?
function to pass in as an argument, since that can be accomplished with a filtering transducer.Lastly, I removed a lot of unused and unnecessary code, and moved the logback-test.xml into a new test-resources directory so it wouldn't be active under the development profile.
I might be forgetting some other changes here because it's a lot.