Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/sorted sets #523

Closed
wants to merge 62 commits into from
Closed

Feature/sorted sets #523

wants to merge 62 commits into from

Conversation

zonotope
Copy link
Contributor

@zonotope zonotope commented Jul 5, 2023

This patch replaces clojure.data.avl with persistent-sorted-set as the sorted set implementation backing the 5 core indexes.

A number of changes had to be made to the internal indexing apis to make them compatible with persistent-sorted-set:

  • persistent-sorted-set does not support sorted maps, so our index branch nodes now use a sorted set instead of a sorted map to organize the branch's children.
    • There is a new index node comparator derived from our standard flake comparators to compare branch nodes. The node comparator treats nodes whose flake ranges overlap as equal, and disjoint branch nodes are compared by comparing their branch intervals.
    • In order to find a range of a branch node's children between two flakes, we now construct two dummy "nodes" out of those flakes, one fore each flake endpoint, whose :first-flake and :rhs attributes are each set to the specified endpoint. The index node comparator defined above will return the correct range using these two nodes.
  • persistent-sorted-set supports efficiently creating sequential subranges called slices which are not themselves sorted sets, unlike clojure.data.avl which directly creates subsets from its subrange api. Changes have been made here to ensure that no set operations were ever directly performed on slices.
  • persistent-supported-set does not support > or < order checks in its slice api, opting instead to only implicitly support >= and <=. This could mean that there could be two index nodes which share and endpoint (the :rhs of one is equal to the :first-flake of the next). I don't think this will be an issue in practices because this should only matter when integrating novelty into existing nodes, and all the nodes in novelty should have at least a different t value than the flakes in any of the existing nodes, so we shouldn't run into nodes that actually share endpoints in practice. The other place this might matter is while re-balancing leaf nodes when the overflow limit is hit, but this patch includes code to explicitly drop overlapping flakes after a split operation.

Some of the behavior of the internal apis also have changed to make use of the efficiencies of persistent sorted set. Now, we must pass a collection to sorted-set-by instead of individual flakes avoiding the need for apply. Also, sorted-set-by now expects the supplied collection of flakes to already be sorted because this kind of set creation is heavily optimized in persistent-sorted-set. All of the flake collections we had were already sorted anyway, so this shouldn't prove to be an issue.

Besides those changes, I took the time to simplify the api of tree-chan to cut down on the transducers we have to supply to it. Instead of always needing to limit the flakes returned between starting endpoints, that is now done automatically by passing the endpoints in as arguments. Also, there's no need for an include? function to pass in as an argument, since that can be accomplished with a filtering transducer.

Lastly, I removed a lot of unused and unnecessary code, and moved the logback-test.xml into a new test-resources directory so it wouldn't be active under the development profile.

I might be forgetting some other changes here because it's a lot.

If two nodes overlap, then they are equal. otherwise, compare them based on the
flake interval they contain.
Use persistent sorted set's efficient slices so we don't have to consider all of
a node's children (linear time) and we can instead limit it to only the children
we care about in logarithmic time.
@zonotope zonotope requested a review from a team July 5, 2023 22:05
@zonotope zonotope self-assigned this Jul 5, 2023
@zonotope zonotope marked this pull request as draft July 6, 2023 14:56
@zonotope
Copy link
Contributor Author

zonotope commented Jul 6, 2023

converted to draft because i've found some issues with queries using the file connection and loaded dbs

@zonotope zonotope marked this pull request as ready for review July 13, 2023 18:59
@zonotope
Copy link
Contributor Author

@fluree/core I'm still characterizing the indexing and querying performance on large ledgers, but this patch is ready for review

@dpetran
Copy link
Contributor

dpetran commented Jul 14, 2023

The changes look good to me, thanks for making your commits so focused - it made reviewing much simpler. Since this is a large change I want to give others some chances to review and also wait until you've got a feel for any perf regressions.

@zonotope
Copy link
Contributor Author

Fixes fluree/core#18

Copy link
Contributor

@mpoffald mpoffald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I appreciate the cleanup, and I'm happy to see the tree-chan api get a little simpler as a bonus. 😄

I know you're still measuring performance implications, so I will also hold off on a formal 👍 for now, but wanted to comment so it's clear I did get a chance to review.

@zonotope zonotope mentioned this pull request Jul 21, 2023
@zonotope
Copy link
Contributor Author

Unfortunately, persistent-sorted-set shows at least a 2x slowdon over clojure.data.avl when accessing the results of sorted set subscans. pss returns a lazy seq while avl returns a fully realized sorted set, and we pay the penalty whenever we consume the entire seq, which we always will do.

Because of this performance regression, I'm closing this without merging.

@zonotope zonotope closed this Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta release Required for Beta/Community Release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants